Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California, Berkeley, CA, USA MosheY.Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
3212
This page intentionally left blank
Aurélio Campilho Mohamed Kamel (Eds.)
Image Analysis and Recognition International Conference, ICIAR 2004 Porto, Portugal, September 29 - October 1, 2004 Proceedings, Part II
Springer
eBook ISBN: Print ISBN:
3-540-30126-7 3-540-23240-0
©2005 Springer Science + Business Media, Inc. Print ©2004 Springer-Verlag Berlin Heidelberg All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America
Visit Springer's eBookstore at: and the Springer Global Website Online at:
http://ebooks.springerlink.com http://www.springeronline.com
Preface
ICIAR 2004, the International Conference on Image Analysis and Recognition, was the first ICIAR conference, and was held in Porto, Portugal. ICIAR will be organized annually, and will alternate between Europe and North America. ICIAR 2005 will take place in Toronto, Ontario, Canada. The idea of offering these conferences came as a result of discussion between researchers in Portugal and Canada to encourage collaboration and exchange, mainly between these two countries, but also with the open participation of other countries, addressing recent advances in theory, methodology and applications. The response to the call for papers for ICIAR 2004 was very positive. From 316 full papers submitted, 210 were accepted (97 oral presentations, and 113 posters) . The review process was carried out by the Program Committee members and other reviewers; all are experts in various image analysis and recognition areas. Each paper was reviewed by at least two reviewing parties. The high quality of the papers in these proceedings is attributed first to the authors, and second to the quality of the reviews provided by the experts. We would like to thank the authors for responding to our call, and we wholeheartedly thank the reviewers for their excellent work in such a short amount of time. We are especially indebted to the Program Committee for their efforts that allowed us to set up this publication. We were very pleased to be able to include in the conference, Prof. Murat Kunt from the Swiss Federal Institute of Technology, and Prof. Mário Figueiredo, of the Instituto Superior Técnico, in Portugal. These two world-renowned experts were a great addition to the conference and we would like to express our sincere gratitude to each of them for accepting our invitations. We would also like to thank Prof. Ana Maria Mendonça and Prof. Luís CorteReal for all their help in organizing this meeting; Khaled Hammouda, the webmaster of the conference, for maintaining the Web pages, interacting with authors and preparing the proceedings; and Gabriela Afonso, for her administrative assistance. We also appreciate the help of the editorial staff from Springer for supporting this publication in the LNCS series. Finally, we were very pleased to welcome all the participants to this conference. For those who did not attend, we hope this publication provides a brief view into the research presented at the conference, and we look forward to meeting you at the next ICIAR conference, to be held in Toronto, 2005.
September 2004
Aurélio Campilho, Mohamed Kamel
This page intentionally left blank
ICIAR 2004 – International Conference on Image Analysis and Recognition
General Chair Aurélio Campilho University of Porto, Portugal
[email protected]
General Co-chair Mohamed Kamel University of Waterloo, Canada
[email protected]
Local Chairs Ana Maria Mendonça University of Porto, Portugal
[email protected]
Luís Corte-Real University of Porto, Portugal
[email protected]
Webmaster Khaled Hammouda University of Waterloo, Canada
[email protected]
Supported by Department of Electrical and Computer Engineering, Faculty of Engineering, University of Porto, Portugal INEB – Instituto de Engenharia Biomédica Pattern Analysis and Machine Intelligence Group, University of Waterloo, Canada
VIII
Organization
Advisory and Program Committee M. Ahmadi M. Ahmed A. Amin O. Basir J. Bioucas M. Cheriet D. Clausi L. Corte-Real M. El-Sakka P. Fieguth M. Ferretti M. Figueiredo A. Fred L. Guan E. Hancock M. Kunt E. Jerningan J. Marques A. Mendonça A. Padilha F. Perales F. Pereira A. Pinho N. Peres de la Blanca P. Pina F. Pla K. Plataniotis T. Rabie P. Scheunders M. Sid-Ahmed W. Skarbek H. Tizhoosh D. Vandermeulen M. Vento R. Ward D. Zhang
University of Windsor, Canada Wilfrid Laurier University, Canada University of New South Wales, Australia University of Waterloo, Canada Technical University of Lisbon, Portugal University of Quebec, Canada University of Waterloo, Canada University of Porto, Portugal University of Western Ontario, Canada University of Waterloo, Canada University of Pavia, Italy Technical University of Lisbon, Portugal Technical University of Lisbon, Portugal Ryerson University, Canada University of York, UK Swiss Federal Institute of Technology, Switzerland University of Waterloo, Canada Technical University of Lisbon, Portugal University of Porto, Portugal University of Porto, Portugal University of the Balearic Islands, Spain Technical University of Lisbon, Portugal University of Aveiro, Portugal University of Granada, Spain Technical University of Lisbon, Portugal University of Jaume I, Spain University of Toronto, Canada University of Toronto, Canada University of Antwerp, Belgium University of Windsor, Canada Warsaw University of Technology, Poland University of Waterloo, Canada Catholic University of Leuven, Belgium University of Salerno, Italy University of British Columbia, Canada Hong Kong Polytechnic, Hong Kong
Organization
Reviewers M. Abasolo A. Adegorite N. Alajlan H. Araújo B. Ávila Z. Azimifar O. Badawy J. Batista A. Buchowicz J. Caeiro L. Chen G. Corkidi M. Correia J. Costeira R. Dara A. Dawoud H. du Buf I. El Rube L. Guan M. Hidalgo J. Jiang J. Jorge A. Kong M. Koprnicky R. Lins W. Mageed B. Miners A. Monteiro J. Orchard M. Piedade J. Pinto M. Portells A. Puga W. Rakowski B. Santos J. Santos-Victor G. Schaefer J. Sequeira J. Silva J. Sousa L. Sousa X. Varona E. Vrscay S. Wesolkowski L. Winger
University of the Balearic Islands, Spain University of Waterloo, Canada University of Waterloo, Canada University of Coimbra, Portugal Universidade Federal de Pernambuco, Brazil University of Waterloo, Canada University of Waterloo, Canada University of Coimbra, Portugal Warsaw University of Technology, Poland Beja Polytechnical Institute, Portugal University of Waterloo, Canada National University of Mexico, Mexico University of Porto, Portugal Technical University of Lisbon, Portugal University of Waterloo, Canada University of South Alabama, USA University of the Algarve, Portugal University of Waterloo, Canada Ryerson University, Canada University of the Balearic Islands, Spain University of Waterloo, Canada Technical University of Lisbon, Portugal University of Waterloo, Canada University of Waterloo, Canada Universidade Federal de Pernambuco, Brazil University of Maryland, USA University of Waterloo, Canada University of Porto, Portugal University of Waterloo, Canada Technical University of Lisbon, Portugal Technical University of Lisbon, Portugal University of the Balearic Islands, Spain University of Porto, Portugal Bialystok Technical University, Poland University of Aveiro, Portugal Technical University of Lisbon, Portugal Nottingham Trent University, UK Laboratoire LSIS (UMR CNRS 6168), France University of Porto, Portugal Technical University of Lisbon, Portugal Technical University of Lisbon, Portugal University of the Balearic Islands, Spain University of Waterloo, Canada University of Waterloo, Canada LSI Logic Canada Corporation, Canada
IX
This page intentionally left blank
Table of Contents – Part II
Biomedical Applications An Automated Multichannel Procedure for cDNA Microarray Image Processing Rastislav Lukac, Konstantinos N. Plataniotis, Bogdan Smolka, Anastasios N. Venetsanopoulos A Modified Nearest Neighbor Method for Image Reconstruction in Fluorescence Microscopy Koji Yano, Itsuo Kumazawa
1
9
An Improved Clustering-Based Approach for DNA Microarray Image Segmentation Luis Rueda, Li Qin
17
A Spatially Adaptive Filter Reducing Arc Stripe Noise for Sector Scan Medical Ultrasound Imaging Qianren Xu, M. Kamel, M.M.A. Salama
25
Fuzzy-Snake Segmentation of Anatomical Structures Applied to CT Images Gloria Bueno, Antonio Martínez-Albalá, Antonio Adán
33
Topological Active Volumes for Segmentation and Shape Reconstruction of Medical Images N. Barreira, M.G. Penedo
43
Region of Interest Based Prostate Tissue Characterization Using Least Square Support Vector Machine LS-SVM S.S. Mohamed, M.M.A. Salama, M. Kamel, K. Rizkalla
51
Ribcage Boundary Delineation in Chest X-ray Images Carlos Vinhais, Aurélio Campilho A Level-Set Based Volumetric CT Segmentation Technique: A Case Study with Pulmonary Air Bubbles José Silvestre Silva, Beatriz Sousa Santos, Augusto Silva, Joaquim Madeira Robust Fitting of a Point Distribution Model of the Prostate Using Genetic Algorithms Fernando Arámbula Cosío
59
68
76
XII
Table of Contents – Part II
A Quantification Tool to Analyse Stained Cell Cultures E. Glory, A. Faure, V. Meas-Yedid, F. Cloppet, Ch. Pinset, G. Stamon, J-Ch. Olivo-Marin Dynamic Pedobarography Transitional Objects by Lagrange’s Equation with FEM, Modal Matching, and Optimization Techniques Raquel Ramos Pinho, João Manuel, R.S. Tavares
84
92
3D Meshes Registration: Application to Statistical Skull Model M. Berar, M. Desvignes, G. Bailly, Y. Payan
100
Detection of Rib Borders on X-ray Chest Radiographs Rui Moreira, Ana Maria Mendonça, Aurélio Campilho
108
Isosurface-Based Level Set Framework for MRA Segmentation Yongqiang Zhao, Minglu Li
116
Segmentation of the Comet Assay Images Bogdan Smolka, Rastislav Lukac
124
Automatic Extraction of the Retina AV Index I.G. Caderno, M.G. Penedo, C. Mariño, M.J. Carreira, F. Gomez-Ulla, F. González
132
Image Registration in Electron Microscopy. A Stochastic Optimization Approach J.L. Redondo, P.M. Ortigosa, I. García, J.J. Fernández Evolutionary Active Contours for Muscle Recognition A. Caro, P.G. Rodríguez, M.L. Durán, J.A. Ávila, T. Antequera, R. Palacios Automatic Lane and Band Detection in Images of Thin Layer Chromatography António V. Sousa, Rui Aguiar, Ana Maria Mendonça, Aurélio Campilho Automatic Tracking of Arabidopsis thaliana Root Meristem in Confocal Microscopy Bernardo Garcia, Ana Campilho, Ben Scheres, Aurélio Campilho
141 150
158
166
Document Processing A New File Format for Decorative Tiles Rafael Dueire Lins
175
Projection Profile Based Algorithm for Slant Removal Moisés Pastor, Alejandro Toselli, Enrique Vidal
183
Table of Contents – Part II
XIII
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal from Binary Document Images Amr R. Abdel-Dayem, Ali K. Hamou, Mahmoud R. El-Sakka
191
Automated Seeded Region Growing Method for Document Image Binarization Based on Topographic Features Yufei Sun, Yan Chen, Yuzhi Zhang, Yanxia Li
200
Image Segmentation of Historical Documents: Using a Quality Index Carlos A.B. de Mello A Complete System for Detection and Identification of Tabular Structures from Document Images S. Mandal, S.P. Chowdhury, A.K. Das, Bhabatosh Chanda
209
217
Underline Removal on Old Documents João R. Caldas Pinto, Pedro Pina, Lourenço Bandeira, Luís Pimentel, Mário Ramalho
226
A New Algorithm for Skew Detection in Images of Documents Rafael Dueire Lins, Bruno Tenório Ávila
234
Blind Source Separation Techniques for Detecting Hidden Texts and Textures in Document Images Anna Tonazzini, Emanuele Salerno, Matteo Mochi, Luigi Bedini Efficient Removal of Noisy Borders from Monochromatic Documents Bruno Tenório Ávila, Rafael Dueire Lins
241 249
Colour Analysis Robust Dichromatic Colour Constancy Gerald Schaefer
257
Soccer Field Detection in Video Images Using Color and Spatial Coherence Arnaud Le Troter, Sebastien Mavromatis, Jean Sequeira
265
New Methods to Produce High Quality Color Anaglyphs for 3-D Visualization Ianir Ideses, Leonid Yaroslavsky
273
A New Color Filter Array Interpolation Approach for Single-Sensor Imaging Rastislav Lukac, Konstantinos N. Plataniotis, Bogdan Smolka
281
A Combinatorial Color Edge Detector Soufiane Rital, Hocine Cherifi
289
XIV
Table of Contents – Part II
Texture Analysis A Fast Probabilistic Bidirectional Texture Function Model Michal Haindl,
298
Model-Based Texture Segmentation Michal Haindl, Stanislav Mikeš
306
A New Gabor Filter Based Kernel for Texture Classification with SVM Mahdi Sabri, Paul Fieguth
314
Grading Textured Surfaces with Automated Soft Clustering in a Supervised SOM J. Martín-Herrero, M. Ferreiro-Armán, J.L. Alba-Castro
323
Textures and Wavelet-Domain Joint Statistics Zohreh Azimifar, Paul Fieguth, Ed Jernigan
331
Video Segmentation Through Multiscale Texture Analysis Miguel Alemán-Flores, Luis Álvarez-León
339
Motion Analysis Estimation of Common Groundplane Based on Co-motion Statistics Zoltan Szlavik, Laszlo Havasi, Tamas Sziranyi An Adaptive Estimation Method for Rigid Motion Parameters of 2D Curves Turker Sahin, Mustafa Unel
347
355
Classifiers Combination for Improved Motion Segmentation Ahmad Al-Mazeed, Mark Nixon, Steve Gunn
363
A Pipelined Real-Time Optical Flow Algorithm Miguel V. Correia, Aurélio Campilho
372
De-interlacing Algorithm Based on Motion Objects Junxia Gu, Xinbo Gao, Jie Li
381
Automatic Selection of Training Samples for Multitemporal Image Classification T.B. Cazes, R.Q. Feitosa, G.L.A. Mota
389
Parallel Computation of Optical Flow Antonio G. Dopico, Miguel V. Correia, Jorge A. Santos, Luis M. Nunes
397
Lipreading Using Recurrent Neural Prediction Model Takuya Tsunekawa, Kazuhiro Hotta, Haruhisa Takahashi
405
Table of Contents – Part II
Multi-model Adaptive Estimation for Nonuniformity Correction of Infrared Image Sequences Jorge E. Pezoa, Sergio N. Torres
XV
413
Surveillance and Remote Sensing A MRF Based Segmentatiom Approach to Classification Using Dempster Shafer Fusion for Multisensor Imagery A. Sarkar, N. Banerjee, P. Nair, A. Banerjee, S. Brahma, B. Kartikeyan, K.L. Majumder Regularized RBF Networks for Hyperspectral Data Classification G. Camps-Valls, A.J. Serrano-López, L. Gómez-Chova, J.D. Martín-Guerrero, J. Calpe-Maravilla, J. Moreno
421
429
A Change-Detection Algorithm Enabling Intelligent Background Maintenance Luigi Di Stefano, Stefano Mattoccia, Martino Mola
437
Dimension Reduction and Pre-emphasis for Compression of Hyperspectral Images C. Lee, E. Choi, J. Choe, T. Jeong
446
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry from Uncalibrated Traffic Surveillance Cameras José Melo, Andrew Naftel, Alexandre Bernardino, José Santos- Victor
454
Robust Tracking and Object Classification Towards Automated Video Surveillance Jose-Luis Landabaso, Li-Qun Xu, Montse Pardas
463
Detection of Vehicles in a Motorway Environment by Means of Telemetric and Visual Data Sonia Izri, Eric Brassart, Laurent Delahoche, Bruno Marhic, Arnaud Clérentin High Quality-Speed Dilemma: A Comparison Between Segmentation Methods for Traffic Monitoring Applications Alessandro Bevilacqua, Luigi Di Stefano, Alessandro Lanza Automatic Recognition of Impact Craters on the Surface of Mars Teresa Barata, E. Ivo Alves, José Saraiva, Pedro Pina Classification of Dune Vegetation from Remotely Sensed Hyperspectral Images Steve De Backer, Pieter Kempeneers, Walter Debruyn, Paul Scheunders
471
481 489
497
XVI
Table of Contents – Part II
SAR Image Classification Based on Immune Clonal Feature Selection Xiangrong Zhang, Tan Shan, Licheng Jiao
504
Depth Extraction System Using Stereo Pairs Rizwan Ghaffar, Noman Jafri, Shoab Ahmed Khan
512
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network Yazhou Liu, Wen Gao, Hongxun Yao, Shaohui Liu, Lijun Wang
520
Tracking LOD Canny Edge Based Boundary Edge Selection for Human Body Tracking Jihun Park, Tae- Yong Kim, Sunghun Park
528
Object Boundary Edge Selection for Accurate Contour Tracking Using Multi-level Canny Edges Tae-Yong Kim, Jihun Park, Seong-Whan Lee
536
Reliable Dual-Band Based Contour Detection: A Double Dynamic Programming Approach Mohammad Dawood, Xiaoyi Jiang, Klaus P. Schäfers
544
Tracking Pedestrians Under Occlusion Using Multiple Cameras Jorge P. Batista
552
Application of Radon Transform to Lane Boundaries Tracking R. Nourine, M. Elarbi Boudihir, S.F. Khelifi A Speaker Tracking Algorithm Based on Audio and Visual Information Fusion Using Particle Filter Xin Li, Luo Sun, Linmi Tao, Guangyou Xu, Ying Jia
563
572
Kernel-Bandwidth Adaptation for Tracking Object Changing in Size Ning-Song Peng, Jie Yang, Jia-Xin Chen
581
Tracking Algorithms Evaluation in Feature Points Image Sequences Vanessa Robles, Enrique Alegre, Jose M. Sebastian
589
Short-Term Memory-Based Object Tracking Hang-Bong Kang, Sang-Hyun Cho
597
Real Time Multiple Object Tracking Based on Active Contours Sébastien Lefèvre, Nicole Vincent
606
An Object Tracking Algorithm Combining Different Cost Functions D. Conte, P. Foggia, C. Guidobaldi, A. Limongiello, M. Vento
614
Table of Contents – Part II
Vehicle Tracking at Traffic Scene with Modified RLS Hadi Sadoghi Yazdi, Mahmood Fathy, A. Mojtaba Lotfizad
XVII
623
Face Detection and Recognition Understanding In-Plane Face Rotations Using Integral Projections Henry Nicponski
633
Feature Fusion Based Face Recognition Using EFM Dake Zhou, Xin Yang
643
Real-Time Facial Feature Extraction by Cascaded Parameter Prediction and Image Optimization Fei Zuo, Peter H.N. de With Frontal Face Authentication Through Creaseness-Driven Gabor Jets Daniel González-Jiménez, José Luis Alba-Castro
651 660
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition Xiaoyi Feng, Abdenour Hadid, Matti Pietikäinen
668
Fast Face Detection Using QuadTree Based Color Analysis and Support Vector Verification Shu-Fai Wong, Kwan-Yee Kenneth Wong
676
Three-Dimensional Face Recognition: A Fishersurface Approach Thomas Heseltine, Nick Pears, Jim Austin
684
Face Recognition Using Improved-LDA Dake Zhou, Xin Yang
692
Analysis and Recognition of Facial Expression Based on Point-Wise Motion Energy Hanhoon Park, Jong-Il Park Face Class Modeling Using Mixture of SVMs Julien Meynet, Vlad Popovici, Jean-Philippe Thiran Comparing Robustness of Two-Dimensional PCA and Eigenfaces for Face Recognition Muriel Visani, Christophe Garcia, Christophe Laurent
700 709
717
Useful Computer Vision Techniques for Human-Robot Interaction O. Deniz, A. Falcon, J. Mendez, M. Castrillon
725
Face Recognition with Generalized Entropy Measurements Yang Li, Edwin R. Hancock
733
XVIII
Table of Contents – Part II
Facial Feature Extraction and Principal Component Analysis for Face Detection in Color Images Saman Cooray, Noel O’Connor
741
Security Systems Fingerprint Enhancement Using Circular Gabor Filter En Zhu, Jianping Yin, Guomin Zhang A Secure and Localizing Watermarking Technique for Image Authentication Abdelkader H. Ouda, Mahmoud R. El-Sakka A Hardware Implementation of Fingerprint Verification for Secure Biometric Authentication Systems Yongwha Chung, Daesung Moon, Sung Bum Pan, Min Kim, Kichul Kim Inter-frame Differential Energy Video Watermarking Algorithm Based on Compressed Domain Lijun Wang, Hongxun Yao, Shaohui Liu, Wen Gao, Yazhou Liu Improving DTW for Online Handwritten Signature Verification M. Wirotius, J. Y. Ramel, N. Vincent Distribution of Watermark According to Image Complexity for Higher Stability Mansour Jamzad, Farzin Yaghmaee
750
759
770
778 786
794
Visual Inspection Comparison of Intelligent Classification Techniques Applied to Marble Classification João M. C. Sousa, João R. Caldas Pinto Inspecting Colour Tonality on Textured Surfaces Xianghua Xie, Majid Mirmehdi, Barry Thomas
802 810
Automated Visual Inspection of Glass Bottles Using Adapted Median Filtering Domingo Mery, Olaya Medina
818
Neuro-Fuzzy Method for Automated Defect Detection in Aluminium Castings Sergio Hernández, Doris Sáez, Domingo Mery
826
Online Sauter Diameter Measurement of Air Bubbles and Oil Drops in Stirred Bioreactors by Using Hough Transform L. Vega-Alvarado, M.S. Cordova, B. Taboada, E. Galindo, G. Corkidi
834
Table of Contents – Part II
XIX
Defect Detection in Textile Images Using Gabor Filters Céu L. Beirão, Mário A.T. Figueiredo
841
Geometric Surface Inspection of Raw Milled Steel Blocks Ingo Reindl, Paul O’Leary
849
Author Index
857
This page intentionally left blank
Table of Contents – Part I
Image Segmentation Automatic Image Segmentation Using a Deformable Model Based on Charged Particles Andrei C. Jalba, Michael H.F. Wilkinson, Jos B.T.M. Roerdink Hierarchical Regions for Image Segmentation Slawo Wesolkowski, Paul Fieguth Efficiently Segmenting Images with Dominant Sets Massimiliano Pavan, Marcello Pelillo
1 9
17
Color Image Segmentation Using Energy Minimization on a Quadtree Representation Adolfo Martínez-Usó, Filiberto Pla, Pedro García-Sevilla
25
Segmentation Using Saturation Thresholding and Its Application in Content-Based Retrieval of Images A. Vadivel, M. Mohan, Shamik Sural, A.K. Majumdar
33
A New Approach to Unsupervised Image Segmentation Based on Wavelet-Domain Hidden Markov Tree Models Qiang Sun, Shuiping Gou, Licheng Jiao
41
Spatial Discriminant Function with Minimum Error Rate for Image Segmentation EunSang Bak
49
Detecting Foreground Components in Grey Level Images for Shift Invariant and Topology Preserving Pyramids Giuliana Ramella, Gabriella Sanniti di Baja
57
Pulling, Pushing, and Grouping for Image Segmentation Guoping Qiu, Kin-Man Lam
65
Image Segmentation by a Robust Clustering Algorithm Using Gaussian Estimator Lei Wang, Hongbing Ji, Xinbo Gao
74
A Multistage Image Segmentation and Denoising Method – Based on the Mumford and Shah Variational Approach Song Gao, Tien D. Bui
82
XXII
Table of Contents – Part I
A Multiresolution Threshold Selection Method Based on Training J.R. Martinez-de Dios, A. Ollero Segmentation Based Environment Modeling Using a Single Image Seung Taek Ryoo Unsupervised Color-Texture Segmentation Yuzhong Wang, Jie Yang, Yue Zhou
90
98 106
Image Processing and Analysis Hierarchical MCMC Sampling Paul Fieguth
114
Registration and Fusion of Blurred Images Filip Sroubek, Jan Flusser
122
A New Numerical Scheme for Anisotropic Diffusion Hongwen Yi, Peter H. Gregson
130
An Effective Detail Preserving Filter for Impulse Noise Removal Naif Alajlan, Ed Jernigan
139
A Quantum-Inspired Genetic Algorithm for Multi-source Affine Image Registration Hichem Talbi, Mohamed Batouche, Amer Draa
147
Nonparametric Impulsive Noise Removal Bogdan Smolka, Rastislav Lukac
155
BayesShrink Ridgelets for Image Denoising Nezamoddin Nezamoddini-Kachouie, Paul Fieguth, Edward Jernigan
163
Image Salt-Pepper Noise Elimination by Detecting Edges and Isolated Noise Points Gang Li, Binheng Song
171
Image De-noising via Overlapping Wavelet Atoms V. Bruni, D. Vitulano
179
Gradient Pile Up Algorithm for Edge Enhancement and Detection Leticia Guimarães, André Soares, Viviane Cordeiro, Altamiro Susin
187
Co-histogram and Image Degradation Evaluation Pengwei Hao, Chao Zhang, Anrong Dang
195
Table of Contents – Part I
XXIII
MAP Signal Reconstruction with Non Regular Grids João M. Sanches, Jorge S. Marques
204
Comparative Frameworks for Directional Primitive Extraction M. Penas, M.J. Carreira, M.G. Penedo, M. Mirmehdi, B.T. Thomas
212
Dynamic Content Adaptive Super-Resolution Mei Chen
220
Efficient Classification Method for Autonomous Driving Application Pangyu Jeong, Sergiu Nedevschi
228
Image Analysis and Synthesis Parameterized Hierarchical Annealing for Scientific Models Simon K. Alexander, Paul Fieguth, Edward R. Vrscay
236
Significance Test for Feature Subset Selection on Image Recognition Qianren Xu, M. Kamel, M.M. A. Salama
244
Image Recognition Applied to Robot Control Using Fuzzy Modeling Paulo J. Sequeira Gonçalves, L.F. Mendonça, J.M.C. Sousa, J.R. Caldas Pinto
253
Large Display Interaction Using Video Avatar and Hand Gesture Recognition Sang Chul Ahn, Tae-Seong Lee, Ig-Jae Kim, Yong-Moo Kwon, Hyoung-Gon Kim
261
Image and Video Coding Optimal Transform in Perceptually Uniform Color Space and Its Application in Image Coding Ying Chen, Pengwei Hao, Anrong Dang
269
Lossless Compression of Color-Quantized Images Using Block-Based Palette Reordering António J.R. Neves, Armando J. Pinho
277
Fovea Based Coding for Video Streaming Reha Civanlar
285
Influence of Task and Scene Content on Subjective Video Quality Ying Zhong, Iain Richardson, Arash Sahraie, Peter McGeorge
295
Evaluation of Some Reordering Techniques for Image VQ Index Compression António R.C. Paiva, Armando J. Pinho
302
XXIV
Table of Contents – Part I
Adaptive Methods for Motion Characterization and Segmentation of MPEG Compressed Frame Sequences C. Doulaverakis, S. Vagionitis, M. Zervakis, E. Petrakis On the Automatic Creation of Customized Video Content José San Pedro, Nicolas Denis, Sergio Domínguez
310 318
Shape and Matching Graph Pattern Spaces from Laplacian Spectral Polynomials Bin Luo, Richard C. Wilson, Edwin R. Hancock A Hierarchical Framework for Shape Recognition Using Articulated Shape Mixtures Abdullah Al Shaher, Edwin R. Hancock
327
335
A New Affine Invariant Fitting Algorithm for Algebraic Curves Sait Sener, Mustafa Unel
344
Graph Matching Using Manifold Embedding Bai Xiao, Hang Yu, Edwin Hancock
352
A Matching Algorithm Based on Local Topologic Structure Xinjian Chen, Jie Tian, Xin Yang
360
2-D Shape Matching Using Asymmetric Wavelet-Based Dissimilarity Measure Ibrahim El Rube’, Mohamed Kamel, Maher Ahmed
368
A Real-Time Image Stabilization System Based on Fourier-Mellin Transform J.R. Martinez-de Dios, A. Ollero
376
A Novel Shape Descriptor Based on Interrelation Quadruplet Dongil Han, Bum-Jae You, Sang-Rok Oh
384
An Efficient Representation of Hand Sketch Graphic Messages Using Recursive Bezier Curve Approximation Jaehwa Park, Young-Bin Kwon
392
Contour Description Through Set Operations on Dynamic Reference Shapes Miroslav Koprnicky, Maher Ahmed, Mohamed Kamel
400
An Algorithm for Efficient and Exhaustive Template Matching Luigi Di Stefano, Stefano Mattoccia, Federico Tombari Modelling of Overlapping Circular Objects Based on Level Set Approach Eva Dejnozkova, Petr Dokladal
408
416
Table of Contents – Part I
A Method for Dominant Points Detection and Matching 2D Object Identification A. Carmona-Poyato, N.L. Fernández-García, R. Medina-Carnicer, F.J. Madrid-Cuevas
XXV
424
Image Description and Recognition Character Recognition Using Canonical Invariants Sema Doguscu, Mustafa Unel
432
Finding Significant Points for a Handwritten Classification Task Juan Ramón Rico-Juan, Luisa Micó
440
The System for Handwritten Symbol and Signature Recognition Using FPGA Computing Rauf K. Sadykhov, Leonid P. Podenok, Vladimir A. Samokhval, Andrey A. Uvarov Reconstruction of Order Parameters Based on Immunity Clonal Strategy for Image Classification Xiuli Ma, Licheng Jiao Visual Object Recognition Through One-Class Learning QingHua Wang, Luís Seabra Lopes, David M.J. Tax Semantic Image Analysis Based on the Representation of the Spatial Relations Between Objects in Images Hyunjang Kong, Miyoung Cho, Kwanho Jung, Sunkyoung Baek, Pankoo Kim
447
455 463
471
Ridgelets Frame Tan Shan, Licheng Jiao, Xiangchu Feng
479
Adaptive Curved Feature Detection Based on Ridgelet Kang Liu, Licheng Jiao
487
Globally Stabilized 3L Curve Fitting Turker Sahin, Mustafa Unel
495
Learning an Information Theoretic Transform for Object Detection Jianzhong Fang, Guoping Qiu
503
Image Object Localization by AdaBoost Classifier Krzysztof Kucharski
511
Cost and Information-Driven Algorithm Selection for Vision Systems Mauricio Marengoni, Allen Hanson, Shlomo Zilberstein, Edward Riseman
519
XXVI
Table of Contents – Part I
Gesture Recognition for Human-Robot Interaction Through a Knowledge Based Software Platform M. Hasanuzzaman, Tao Zhang, V. Ampornaramveth, M.A. Bhuiyan, Yoshiaki Shirai, H. Ueno Appearance-Based Object Detection in Space-Variant Images: A Multi-model Approach V. Javier Traver, Alexandre Bernardino, Plinio Moreno, José Santos-Victor
530
538
3D Object Recognition from Appearance: PCA Versus ICA Approaches M. Asunción Vicente, Cesar Fernández, Oscar Reinoso, Luis Payá
547
A Stochastic Search Algorithm to Optimize an N-tuple Classifier by Selecting Its Inputs Hannan Bin Azhar, Keith Dimond
556
Video Processing and Analysis A Multi-expert Approach for Shot Classification in News Videos M. De Santo, G. Percannella, C. Sansone, M. Vento
564
Motion-Compensated Wavelet Video Denoising Fu Jin, Paul Fieguth, Lowell Winger
572
Alpha-Stable Noise Reduction in Video Sequences Mohammed El Hassouni, Hocine Cherifi
580
Automatic Text Extraction in Digital Video Based on Motion Analysis Duarte Palma, João Ascenso, Fernando Pereira
588
Fast Video Registration Method for Video Quality Assessment Jihwan Choe, Chulhee Lee
597
Hidden Markov Model Based Events Detection in Soccer Video Guoying Jin, Linmi Tao, Guangyou Xu
605
3D Imaging Improving Height Recovery from a Single Image of a Face Using Local Shape Indicators Mario Castelán, Edwin R. Hancock Recovery of Surface Height from Diffuse Polarisation Gary Atkinson, Edwin Hancock
613 621
Table of Contents – Part I
XXVII
Vectorization-Free Reconstruction of 3D CAD Models from Paper Drawings Frank Ditrich, Herbert Suesse, Klaus Voss
629
Plane Segmentation from Two Views in Reciprocal-Polar Image Space Zezhi Chen, Nick E. Pears, Bojian Liang, John McDermid
638
Tracking of Points in a Calibrated and Noisy Image Sequence Domingo Mery, Felipe Ochoa, René Vidal
647
Multiresolution Approach to “Visual Pattern” Partitioning of 3D Images Raquel Dosil, Xosé R. Fdez- Vidal, Xosé M. Pardo
655
Visual Cortex Frontend: Integrating Lines, Edges, Keypoints, and Disparity João Rodrigues, J.M. Hans du Buf
664
Estimation of Directional and Ambient Illumination Parameters by Means of a Calibration Object Alberto Ortiz, Gabriel Oliver
672
Environment Authentication Through 3D Structural Analysis Toby P. Breckon, Robert B. Fisher
680
Camera Calibration Using Two Concentric Circles Francisco Abad, Emilio Camahort, Roberto Vivó
688
Three-Dimensional Object Recognition Using a Modified Exoskeleton and Extended Hausdorff Distance Matching Algorithm Rajalida Lipikorn, Akinobu Shimizu, Hidefumi Kobatake
697
Recognition of 3D Object from One Image Based on Projective and Permutative Invariants J.M. González, J.M. Sebastián, D. García, F. Sánchez, L. Angel
705
Wide Baseline Stereo Matching by Corner-Edge-Regions Jun Xie, Hung Tat Tsui
713
Gradient Based Dense Stereo Matching Tomasz Twardowski, Boguslaw Cyganek, Jan Borgosz
721
Image Retrieval and Indexing Accelerating Multimedia Search by Visual Features Grzegorz Galinski, Karol Wnukowicz, Wladyslaw Skarbek
729
Semantic Browsing and Retrieval in Image Libraries Andrea Kutics, Akihiko Nakagawa
737
XXVIII
Table of Contents – Part I
Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan, Paul Fieguth, Mohamed Kamel
745
A Novel Shape Feature for Image Classification and Retrieval Rami Rautkorpi, Jukka Iivarinen
753
A Local Structure Matching Approach for Large Image Database Retrieval Yanling Chi, Maylor K.H. Leung
761
People Action Recognition in Image Sequences Using a 3D Articulated Object. Jean-Charles Atine
769
CVPIC Compressed Domain Image Retrieval by Colour and Shape Gerald Schaefer, Simon Lieutaud
778
Automating GIS Image Retrieval Based on MCM Adel Hafiane, Bertrand Zavidovique
787
Significant Perceptual Regions by Active-Nets David García-Pérez, Antonio Mosquera, Marcos Ortega, Manuel G. Penedo
795
Improving the Boosted Correlogram Nicholas R. Howe, Amanda Ricketson
803
Distance Map Retrieval László Czúni,
811
Gergely Császár
Grass Field Segmentation, the First Step Toward Player Tracking, Deep Compression, and Content Based Football Image Retrieval Kaveh Kangarloo, Ehsanollah Kabir
818
Spatio-temporal Primitive Extraction Using Hermite and Laguerre Filters for Early Vision Video Indexing Carlos Joel Rivero-Moreno, Stéphane Bres
825
Non-parametric Performance Comparison in Pictorial Query by Content Systems Sergio Domínguez
833
Morphology Hierarchical Watersheds with Inter-pixel Boundaries Luc Brun, Philippe Vautrot, Fernand Meyer
840
From Min Tree to Watershed Lake Tree: Theory and Implementation Xiaoqiang Huang, Mark Fisher, Yanong Zhu
848
Table of Contents – Part I
From Min Tree to Watershed Lake Tree: Evaluation Xiaoqiang Huang, Mark Fisher
XXIX
858
Optimizing Texture Primitives Description Based on Variography and Mathematical Morphology Assia Kourgli, Aichouche Belhadj-aissa, Lynda Bouchemakh
866
Author Index
875
This page intentionally left blank
An Automated Multichannel Procedure for cDNA Microarray Image Processing Rastislav Lukac1, Konstantinos N. Plataniotis1, Bogdan Smolka2*, and Anastasios N. Venetsanopoulos1 1
The Edward S. Rogers Sr. Dept. of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, M5S 3G4, Canada {lukacr, kostas, anv}@dsp.utoronto.ca 2
Polish-Japanese Institute of Information Technology, Koszykowa 86 Str, 02-008 Warsaw, Poland
[email protected]
Abstract. In this paper, an automated multichannel procedure capable of processing cDNA microarray images is presented. Using a cascade of nonlinear filtering solutions based on robust order-statistics, the procedure removes both background and high-frequency corrupting noise, and correctly identifies edges and spots in cDNA microarray data. Since the method yields excellent performance by removing noise and enhancing spot location determination, the proposed set of cascade operations constitute the perfect tool for subsequent microarray analysis and gene expression tasks.
1
Introduction
Microarray imaging technology [1] is used to effectively analyze changes caused by carcinogens and reproductive toxins in genome-wide patterns of gene expression in different populations of cells. Using a two-color (Cy3/Cy5) system, Complementary Deoxyribonucleic Acid (cDNA) microarrays are formed as twochannel, Red-Green images (Fig.1) The Red (R) color band is used to indicate particular genes expressed as spots in the experimental (Cy5) channel while the Green (G) portion of the signal denotes the spots corresponding to the control (Cy3) channel [8]. Yellow spots indicate the coincidence of genetic sequences. The spots occupy a small fraction of the image area and they should be individually located and isolated from the image background prior to the estimation of its mean intensity. The large number of spots, usually in the thousands, and their shape and position irregularities necessitate the use of a fully automated procedure to accomplish the task [1],[5]. Variations in the image background, and the spot sizes and positions represent the major sources of uncertainty in spot finding and gene expression determination [7],[11]. Noise contamination mostly in the form of photon noise, electronic noise, laser light reflection and dust on * This research has been supported by a grant No PJ/B/01/2004 from the Polish Japanese Institute of Information Technology A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 1–8, 2004. © Springer-Verlag Berlin Heidelberg 2004
2
R. Lukac et al.
Fig. 1. cDNA microarray image: (a) RG image, (b) decomposed R channel, (c) decomposed B channel.
the glass slide results in the introduction of a substantial noise floor in the microarray image. Since automated spot localization tools can mistakenly declare bright artifacts as spots, image filtering prior to subsequent analysis is necessary [11]. The removal of noise in cDNA microarray images makes spot detection and analysis easier, and results in accurate gene expression measurements that can be readily interpreted and analyzed [12].
2
Fundamentals of Multichannel Image Processing
Let us consider, a two-channel image representing a two-dimensional matrix of two-component samples with and denoting the image row and column, respectively. As it is shown in Fig.1, cDNA microarrays can be viewed as two-channel Red-Green (RG) images [8]. Components for represent the elements of the vectorial input with indicating the R component and indicating the G component. Thus each two-channel sample is represented by a two-dimensional vector in the vector space [6]. Each vector is uniquely defined by its length (magnitude) and orientation (direction) with denoting the unit sphere defined in the vector space [6]. Due to the numerous noise impairments microarray images suffer from variations in intensity [8],[11]. Since samples deviating significantly from their neighbors usually denote outliers in the data population, the most obvious way to remove atypical samples is applying a smoothing operator [9]. Using a sliding 3×3 square-shape window of finite size (here N = 9) the filtering procedure replaces the sample placed in the window center through a function applied to a local neighborhood area W. This window operator slides over the entire image, for
An Automated Multichannel Procedure
3
and to cover all the pixels in the microarray image [6],[9]. In order to determine outlying samples in the input set W, the differences between any two vectors and should be quantified. The most natural way is to evaluate differences using both magnitude and direction [6]. Using the well-know Euclidean metric, the difference in magnitude between two microarray vectors is defined as:
The difference in their directional characteristics can be calculated as:
It has been demonstrated in [7] that cDNA microarray images contain vectors which differentiate mostly in magnitude and that directional processing of the microarray vector data may not be helpful in removing noise impairments. It is also well-known that microarray images are nonlinear in nature due to presence of spots, variations between foreground and background, and numerous noise sources affecting the image formation process. Therefore, the proposed solution is a nonlinear multichannel scheme operating in the magnitude domain of microarray images. It will be shown in the sequence that the scheme preserves important structural elements such as spot edges, and at the same time eliminate microarray image impairments.
3
Nonlinear Cascade Operations for cDNA Microarray Image Processing
Our method introduced in [7] utilizes nonlinear cascade operations based on robust order-statistic theory. Since the extreme, noise-like observations maximize the aggregated distances to other inputs located inside the supporting window W, the samples minimizing the distance criterion correspond to robust estimates of the actual, noise-free pixel [3],[6],[10]. Each input sample for is associated with a non-negative value equal to the aggregated Euclidean distances among the vectorial inputs [6],[10]:
Since aggregated distances for are scalar quantities, they can be ordered according to their values resulting in the ordered set
where placed on the
for denotes the ordered item rank. Assuming that the ordering of the aggregated distances
4
R. Lukac et al.
implies the same ordering of the corresponding vectors the procedure reports an ordered set of vectors
for
where denotes the vector order-statistics [6],[10]. It is evident that the lowest ranked vector is associated with the minimum aggregated distances and the uppermost ranked vector corresponds to the maximum aggregated distances Since (3) expresses the similarity of vector to other vectors inside W, the lowest ranked vector is the most typical sample for the vectorial set W. On the other hand, due to the maximum dissimilarity the uppermost ranked vector usually corresponds to an outlier present in the input set W. To remove high-frequency noise and preserve edge information at the same time, the robust vector median filter (VMF) [3] is employed at the first processing level of the cascade method of [7]. The output of the VMF scheme is the input vector minimizing the distance to all other samples inside the input set W:
where is the Euclidean distance. It is not difficult to see that this minimization principle results in the lowest ranked vector order statistics equivalently obtained using the operations (4) and (5). Since the ordering can be used to determine the positions of the different input vectors without any prior information regarding the signal distributions, vector order-statistics filters, such as the VMF, are considered to be robust estimators. Moreover, the impulse response of the VMF is zero. This suggests that VMF excellently suppresses impulsive noise and outliers in vector data populations [3]. However, due to the underlying low-pass filtering concept, the VMF and its variants do not normalize the variations in the background which prohibit the correct recognition of the spots formed by vectors with low intensity components. To remove background noise, the second processing stage in the proposed method employs both the spectral image characteristics and the minimization concept of (5) in a unique and novel way [7]. Instead of minimizing the differences in magnitude calculated over spatially adjusted vector inputs inside W, we minimize distance measures defined in the color-ratio domain [7]:
where for denotes the R/G ratio quantities associated with the vectorial inputs Similarly to (5) the lowest ratioorder statistics minimizes the aggregated absolute differences to other input ratios. This produces the ratio value typical for a localized image region. Due to the decreased high-frequency portion of the color-ratio signal, we are able to preserve the structural content of the cDNA image and at the same time to
An Automated Multichannel Procedure
5
remove low-level variations attributed to the background noise. Since the output value obtained in (7) is defined in the ratio domain, it is mapped to the intensity domain through the following normalization operation [7]:
where is a vector, whose components are used to normalize the output ratio in order to recover the individual output R and G intensities and Using order-statistic concept [9], is defined here as the component-wise median filter (MF). It should be emphasized that the MF quantity minimizes the absolute differences to R components (scalar values) of the set for Analogously, the MF value minimizes the absolute differences to other G components for localized within W. Thus, the nonlinear median operation is used to normalize the outputted order-statistics obtained in the ratio domain. In the last processing stage of the proposed method, microarray spots are detected using a vector order-statistic based edge operator, [7]. This choice is reasonable due to the fact that the spectral correlation which exists between the RG channels of the microarray images necessitates the modelling of edges as discontinuities in vector data [7]. Moreover, vector-based edge detectors which utilize order-statistics are considered immune against residual noise which may be present after the preceding steps, and they can be efficiently implemented in either hardware and software. The so-called vector range (VR) detector is defined as follows [6],[10]:
where the vectors and correspond to the vectors with the maximum and minimum aggregated Euclidean distances inside W. Thus, quantitatively expresses the deviation of values within W. By thresholding the presence of an edge can be determined. Due to robustness of the cascade operations (6) and (8) used in prior steps the VR operator produces excellent results while localizing the microarray spots. Note that by cascading the processing levels (6) and (8) the noise impairments are perfectly removed. The use of (9) does not prohibit the application of segmentation methods or other microarray processing tools such as shape manipulation and grid adjustment schemes. Therefore, the proposed method can be employed in any microarray analysis and gene expression tools or microarray image processing pipeline [7].
4
Experimental Results
A variety of microarray images captured using laser microscope scanners have been used to test the performance of the proposed method. Note that the images vary in complexity and noise characteristics. Fig. 2a shows a cDNA microarray input used to facilitate the visual comparison of the images obtained at different
6
R. Lukac et al.
Fig. 2. Obtained results: (a,e) acquired (input) microarray image and the corresponding edge map, (b,f) output image and the corresponding edge map obtained using (6) and (9), (c,g) output image and the corresponding edge map obtained using (8) and (9), (d,h) output image and the corresponding edge map obtained using the proposed operations (6),(8) and (9).
stages of the proposed method. Figs.2a, show the images obtained when only (9) is used (i.e. the noise removal operations are excluded). It can be seen that noise impairments prohibit the correct spot localization. Figs.2b, show that if (6) followed by (9) are used, high-frequency impairments (foreground noise) are eliminated, however, no spots are localized in the image regions affected by a background noise. On the other hand, using the processing steps of (8) and (9) the proposed method removes background noise (Fig.2c). However, this operation does not eliminate foreground noise which are further amplified (Fig.2g) by the spot localization procedure of (9). When the proposed method uses the complete set of image processing operations defined via (6), (8) and (9), the procedure excellently removes both foreground and background noise (Fig.2d) and clearly localize all regular spots (Fig.2h). It can be easily observed that outliers, shoot noise, fluorescence artifacts and background noise are removed when the complete proposed processing cycle is used. This is also confirmed by the 3-D plots of cDNA microarrays shown in Fig. 3. It is not difficult to see that the use of (6), corresponding to VMF, does not eliminate variations in the background. Moreover, Fig.3b shows that due to the use of the conventional low-pass filter the spots described by low intensities vanish in image background. On the other hand, in the R/G ratio domain defined (8) eliminates background impairments, however, this step does not remove noise spikes (Fig.3c). Visual inspection of the results depicted in Fig.3d reveals that
An Automated Multichannel Procedure
7
Fig. 3. Three-dimensional plots of the results corresponding to the images shown in Fig.2a-d: (a) input microarray image, (b) output image obtained using (6), (c) output image obtained using (8), (d) output image obtained using (6) followed by (8).
the use of the cascade consisting of the operations defined in (6) and (8) produces output image with ideal, steep spot edges which can be easily detected (Fig.2h) by completing the tasks using (9).
5
Conclusion
A new method for cDNA image processing was introduced. The proposed method uses the cascade nonlinear operations in order to: i) remove foreground noise, ii) eliminate background noise, and iii) localize the spots in microarray images. The method utilizes the spectral correlation characteristics of the microarray image in conjunction with the minimization principle based on order-statistics theory. Employing these concepts the proposed method excellently removes noise present in the cDNA microarray images and at the same time preserves the structural content of the image. If the noise removal operations (6) and (8) are used prior the edge vector operator of (9), the procedure clearly localizes regular microarray spots end edge-discontinuities in the microarray images.
8
R. Lukac et al.
References 1. Ajay, N., Tokuyasu, T., Snijders, A., Segraves, R., Albertson, D., and Pinkel, D.: Fully automatic quantification of microarray image data. Genome Research 12 (2002) 325–332 2. Arena, P., Bucolo, M., Fortuna, L., Occhipinty, L.: Celular neural networks for real-time DNA microarray analysis. IEEE Engineering in Medicine and Biology, 21 (2002) 17–25 3. Astola, J., Haavisto, P., Neuvo, Y.: Vector median filters. Proceedings of the IEEE 78 (1990) 678–689 4. Bozinov, D.: Autonomous system for web-based microarray image analysis. IEEE Transactions on Nanobioscience 2 (2003) 215–220 5. Katzer, M., Kummert, F., Sagerer, G.: Methods for automatic microarray image segmentation. IEEE Transactions on Nanobioscience 2 (2003) 202–213 6. Lukac, R., Smolka, B., Martin, K., Plataniotis, K.N., Venetsanopulos, A.N.: Vector filtering for color imaging. IEEE Signal Processing Magazine - Special Issue on Color Image Processing 21 (2004) 7. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N.: A multichannel order-statistic technique for cDNA microarray image processing. IEEE Transactions on Nanobioscience, submitted (2004) 8. Nagarajan, R.: Intensity-based segmentation of microarrays images. IEEE Transactions on Medical Imaging 22 (2003) 882-889. 9. Pitas, I., Venetsanopoulos, A.N.: Order statistics in digital image processing. Proceedings of the IEEE 80 (1992) 1892–1919 10. Plataniotis, K.N., Venetsanopoulos, A.N.: Color image processing and applications. Springer Verlag, 2000 11. Wang, X.H., Istepian, R.S.H., Song, Y.H.: Microarray image enhancement using stationary wavelet transform. IEEE Transactions on Nanobioscience 2 (2003) 184– 189 12. Zhang, X.Y., Chen, F., Zhang, Y.T., Agner, S.G., Akay, M., Lu, Z.H., Waye, M.M.Y., Tsui, S.K.W.: Signal processing techniques in genomic engineering. Proceedings of the IEEE 90 (2002) 1822–1833
A Modified Nearest Neighbor Method for Image Reconstruction in Fluorescence Microscopy Koji Yano and Itsuo Kumazawa Imaging Science and Engineering Laboratory, Tokyo Institute of Technology, Yokohama 226-8503, Japan,
[email protected]
Abstract. The fluorescence microscopy is suitable to observe a plane surface of an object but, when it comes to observe a three-dimensional structure of an object, light emissions from structures at different depth are mixed and it is difficult to isolate a section image on a target plane. Various techniques have been proposed to solve this problem but current techniques require an unaffordable computation cost and simplified techniques to save the cost, such as the nearest neighbor method, only produce low quality images. In this paper, we propose a technique that separates the out-of-focus-effect from front planes and that from behind planes. We evaluated effectiveness of this technique through some experiments and showed improved results under reduced computational cost.
1 Introduction The fluorescence microscopy is suitable to observe a plane surface of an object but, when it comes to observe a three-dimensional structure of an object, light emissions from structures at different depth are mixed and it is difficult to isolate a section image on a target plane. Various techniques have been proposed to solve this problem. These methods often use multiple images observed with different focus conditions and compute the section image on the objective plane basically by applying deconvolution based techniques to these multiple images. The application of genuine deconvolution requires an affordable computation cost and tends to yield unstable results as this kind of inverse problem is often very sensitive to noises or errors included in observed images [4]. In order to solve these problems, Nearest Neighbor Method (NN Method) and Expectation Maximization Method (EM Method) were proposed and currently used in commercial products. NN Method, which is originally derived from the deconvolution method, is extremely simplified and computes the section image on the target plane using only three images observed at the target plane and its neighboring planes. As a result, reconstructed images tend to be very poor with remained affects from other planes and loosing miss-removed image components. The EM method is federal to the original deconvolution and gives reasonable image quality but still requires an affordable amount of computation as it uses multiple images observed at a number of different focus distances. In this paper, we propose a modified NN Method which introduces a new filter for images on the neighboring planes. With this filter, the proposed method A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 9–16, 2004. © Springer-Verlag Berlin Heidelberg 2004
10
K. Yano and I. Kumazawa
separates the effect by light emission from planes in front of the target plane and that from planes behind the target plane. By separating these effects, the proposed method computes the image on the target plane more accurately than the conventional NN Method does. We evaluated effectiveness of this method through some experiments and showed improved results under reduced computational cost.
2
Formulation of Observation and Reconstruction
An image observed by the fluorescence microscopy focusing on a plane at depth (depth is specified by the coordinate throughout this paper) is formulated as where notation means convolution and is the point spread function determined by the optical characteristics of the fluorescence microscopy. This point spread function takes the form of three dimensional function as emitted lights from planes with different affects the observed image in case of the fluorescence microscopy. The section images on different planes are blurred and mixed according to Eq.(1) and give the observed image The Fourier transforms of and are notated as I and O respectively in the following discussion. According to [7], we model psf(Point Spread Function) in Eq. (1) by
where is a function of and, in this paper, is defined simply as a suitable value of the constant alpha.
2.1
with
Nearest Neighbor(NN) Method
NN Method [1] approximates Eq.(1) by using only three images and which are observed when the fluorescence microscopy is focused on the target plane the plane at and its neighboring planes and respectively, and computes the section image on the plane by where and mean the two dimensional Fourier transforms of and the point spread function with respect to for a fixed respectively. This method assumes that the most significant effect on comes from its neighbors, which are and with the weighting factor and the effects from other planes are negligible [1]. The optimum value for is shown to be 0.49 [2] [3]. The computation is so simplified in NN Method that the accurate reconstruction is not expected. However, the amount of computation is dramatically reduced and even the results of high resolution images are obtained in a couple of seconds with the additional acceleration by FFT.
A Modified Nearest Neighbor Method
2.2
11
EM Method
As EM Method is indifferent to our method, we just give a rough outline of EM Method. According to [5] [6], we can compute Eq.(1) by EM Method as follows.
This method is known to give the most acceptable results among existing algorithms [4] with relatively less sensitivity against noises. However, it requires unaffordable computation cost with a couple of days to complete a three dimensional reconstruction in some cases.
3
Problems in NN Method
Fig.1(a) illustrates the three section images of the target object. The observed images by the fluorescence microscopy focusing on planes corresponding to these section images are illustrated in 1(b).
Fig. 1. Section images and observed images
Using these figures, we give a consideration on what are factors degenerating the reconstruction images when the NN Method is applied. To simplify the explanation, we use 1 as the value of in Eq.(3). Then the part of Eq. (3):
is expected to remove the gray triangle in Fig.1(b) with the cancellation effect by this subtraction(Eq.(5)). However, if we further execute additional subtraction (the third term in Eq.(3)), such as
12
K. Yano and I. Kumazawa
then the gray triangle once removed by (5) is subtracted again. This problem of double subtraction seems to be solved if we use 0.5 as the value for alpha but it still causes errors inside the overlapped area of the triangle region and the circle region in Fig. 1(b). As the blur happens in a different degree in these three triangles, this error in the overlapped area is inevitable and degenerates the reconstruction result. The optimum value 0.49 is found just as a compromise between these problems. In addition, another factor of errors exists along the boundary of the regions in Fig. 1(b). Let us consider this problem using Eq.(5) again. The circular region is observed without blur when focusing on the plane. On the other hand the circle region is observed with blur when focusing on the plane. If we execute the subtraction of Eq.(5) under this circumstance, the blurred boundary of the circle region is subtracted from the un-blurred boundary of the circle region and it causes errors. These errors, however, have an effect similar to the edge enhancement and sometimes sharpen the images unintentionally but should be reduced to know the true section images.
4
Proposed Methods
To reduce the errors mentioned in the previous section, we modify Nearest Neighbor (NN) Method. We call this modified method Proposed-Method-I and later compare the performances of NN Method and Proposed-Method-I by some experiments. We also applied EM Method to the image reconstructed by ProposedMethod-I and examined how the image quality and the computation time are improved by using the image reconstructed by Proposed-Method-I as an initial image for the iteration procedure of EM Method. We call this method which combines Proposed-Method-I and EM Method as Proposed-Method-II.
4.1
Proposed-Method-I
In this method, we define two images
and
by the following equations:
means an image which includes only the effect of light emissions from the planes in front of the target plane. In other words, is obtained when we remove the blur factor caused by light emissions from from the observed image By the similar way, means an image which includes only the effect of light emissions from the planes behind the target plane. In other words, is obtained when we remove the blur factor caused by light emissions from from the observed image Fig.2(c) illustrates this situation.
A Modified Nearest Neighbor Method
Fig. 2. Section images, observed images and definitions of
13
and
To solve the problems of NN Method mentioned in the previous section, we use and defined by Eq.(7) as the substitutes of in Eq. (3). As the result, the reconstructed image Ô by Proposed-Method-I is formulated as follows.
In this equation, does not include any components of And and do not include any common components. So the following problems of NN Method do not happen any more. 1. Double subtraction of the same component. 2. Unusual enhancement of edges.
4.2
Proposed-Method-II
To improve the performance of EM Method, we use the result of ProposedMethod-I, that is ô computed by Eq.(9), as the initial image for the iteration procedure of EM Method. After obtaining images reconstructed by ProposedMethod-I for every planes ( plane, we use this three dimensional image as an initial image:
and apply the procedure shown in Eq.(4).
14
5
K. Yano and I. Kumazawa
Computation Procedure
We give detailed computation procedure for Proposed-Method-I in this section.
5.1
Assumption
In order to compute using only and following relationship on OTF of the point spread function.
we assume the
The point spread function defined by Eq.(2) does not satisfy this assumption in a strict sense, however, it is proven later that the assumed relation is a good approximation and the error caused by this difference is sufficiently small. In addition, we assume the following relationship.
With these assumption, we can compute and as follows.
5.2
and
using only
Derivation can be separated into the following two parts.
By using the relations assumed in Eq.(ll) and Eq.(12), this can be
As the first terms in Eq.(13) and Eq.(14) are same and the second term includes defined by Eq.(8), we can obtain the following equation by subtracting Eq.(14) from Eq.(13).
A Modified Nearest Neighbor Method
15
Fig. 3. Images from mosquito’s intestines.
By dividing the both hand sides of the above equation with we can obtain the approximation of which is denoted as as follows.
Thus way,
6
defined by Eq.(8) is computed using only is computed using only and
and
In the similar
Experimental Results
We conducted experiments using actual data offered at http://www.aqi.com/. These data (Fig.3(a)) are images observed by fluorescence microscopy from mosquito’s intestines. The observed image and reconstructed images are shown in Fig.3. According to these results, Proposed-Method-I and Proposed-MethodII provided almost as good results as EM Method. While the computational
16
K. Yano and I. Kumazawa
cost of Proposed-Method-I is almost same as that of NN Method and much lower than the cost of EM Method, it reconstructed much better images than NN Method. The computation times required to get these results were around 2.5 seconds for NN Mehod, 2.4 seconds for Proposed-Method-I, 3200 seconds for EM Method and 3350 seconds for Proposed-Method-II.
7
Conclusions
In this paper, we proposed a modified nearest neighbor method (ProposedMethod-I) which reduces errors caused by the mixture of lights emitted from un-focused planes and reconstructs an improved section image on the target plane. We also proposed an EM Method which uses the image reconstructed by Proposed-Method-I as an initial image for its iteration procedure (ProposedMethod-II). Proposed-Method-I is shown to give improved reconstruction images under reduced computational cost comparing to the conventional NN Method. Proposed-Method-II is shown to give better convergence comparing to the standard EM Method and reconstructs images with improved quality comparing to Proposed-Method-I.
References 1. D.A.Agard . “Optical sectioning microscopy: cellular architecture in three dimensions.” Annual Reviews in Biophysics and Bioengineering, vol.13, pp.191-219, 1984 2. Randy Hudoson, John N.Aarsvold, Chin-Tu Chen, Jie Chen, Peter Davies, Terry Disz, Ian Foster, Melvin Griem, Man K.Kwong, Biquan Lin “AnOptical Microscopy System For 3D Dynamic Imaging.” 3. Jie Chen, Johon Aarsvold, Chin-Tu Chen . “High-Performance Image Analysis and Visualization for Three-dimensional Light Microscopy” 1997. 4. Geert M.P. van Kempen. “Image Restoration in Fluorescence Microscopy” 1999. 5. Tunithy J. Holmes “Light Microscopic Images Reconstructed by Maximum Likelihood Deconvolution” Handbook of Biological Confocal Microscopy, Plenum Press,New York , pp386-402 1995. 6. Jose-Angel Conchello , James G.McNally “Fast regularization technique for expectation maximization algorithm for optical sectioning microscopy” SPIE 2655, 199-208 1996. 7. Kubota and Aizawa “Reconstruction of images with arbitrary degree of blur by using two images of different focus conditions” Journal of IEICE (in Japanese), Vol. J83 D-II No.12, Dec.2000 . 8. “AutoDeblur” http://www.aqi.com/.
An Improved Clustering-Based Approach for DNA Microarray Image Segmentation Luis Rueda and Li Qin School of Computer Science University of Windsor 401 Sunset Ave., Windsor, ON N9B 3P4, Canada {lrueda,qin1}@uwindsor.ca
Abstract. DNA Microarrays are powerful techniques that are used to analyze the expression of DNA in organisms after performing experiments. One of the key issues in the experimental approaches that utilize microarrays is to extract quantitative information from the spots, which represent the genes in the experiments. In this process, separating the background from the foreground is a fundamental problem in DNA microarray data analysis. In this paper, we present an optimized clustering-based microarray image segmentation approach. As opposed to traditional clustering-based methods, we use more than one feature to represent the pixels. The experiments show that our algorithm performs microarray image segmentation more accurately than the previous clustering-based microarray image segmentation methods, and does not need a post-processing stage to eliminate the noisy pixels.
1 Introduction Microarray technology has been recently introduced and provides solutions to a wide range of problems in medicine, health and environment, drug development, etc. They make use of the sequence resources created by current genome projects and other sequencing efforts to identify the genes, which are expressed in a particular cell type or an organism [3]. Measuring gene expression levels in variable conditions provides biologists with a better understanding of gene functions, and has wide applications in life sciences. As DNA microarray technology emerges, it conforms to simple, yet efficient tools for experimental explorations of genomic structures, gene expression programs, gene functions, and cell and organism biology. It is widely believed that gene expression data contain information that allows us to understand higher-order structures of organisms and their behavior. Besides their scientific significance, gene expression data have important applications in pharmaceutical and clinical research [11]. A DNA microarray (or microchip) is a glass slide, in which DNA molecules are attached at fixed locations, which are called spots, each related to a single gene. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 17–24, 2004. © Springer-Verlag Berlin Heidelberg 2004
18
L. Rueda and L. Qin
Microarrays exploit the theory of preferential binding of complementary singlestranded DNA sequences (cDNA, for short), i.e. complementary single stranded DNA sequences tend to attract each other and the longer the complementary parts, the stronger the attraction [2]. While multi-channel microchips are currently being devised, today’s microarray experiments are used to compare gene-expression from two samples, one called target (or experimental) and the other called control. The two samples are labeled by synthesizing single stranded cDNAs that are complementary to the extracted mRNA. In this paper, we introduce an optimized microarray image segmentation algorithm, which, as opposed to the traditional methods, utilizes more than one feature. Our method has been shown to be more accurate than the previous clustering-based approaches, and does not need a post-processing stage for noise removal.
2 Problem Formulation In general, the analysis of DNA microarray gene expression data involves many steps. The first steps consist of extracting gene expression data from the microarray image, and include spot localization (or gridding), foreground and background separation (image segmentation), and normalization. The first stages are quite important, since the accuracy of the resulting data is essential in posterior analyses. The second step is gene expression data analysis. After the ratios of the intensities are obtained, various methods can be applied to cluster the genes into different function groups based on the ratios retrieved in the first step. In this paper, we deal with the problem of microarray image segmentation. In general, segmentation of an image refers to the process of partitioning the image into several regions, each having its own properties [12]. In microarray image processing, segmentation refers to the classification of pixels as either the signal or the surrounding area, i.e. foreground or background. As a result of partitioning, the foreground pixels fall into one group, and the background pixels fall into another group. There may exist other types of pixels, such as noisy pixels, which are contaminated pixels produced during microarray production and scanning process, and should be excluded from either the background or the foreground region during segmentation. Depending on the approaches used to classify the pixels, another possible type of pixels includes the edge pixels surrounding the foreground region. Since the intensities of these pixels fall in between the foreground and the background, including or excluding them may lead to different signal to noise ratios. The problem can be stated more formally as follows. Let R be an m-by-n integer-valued matrix that represents the image corresponding to the red channel (Cy3), {R(i,j) i=1,2,...,m; j=1,2,...,n}, and G be an m-by-n integer-valued matrix that represents the image corresponding to the green channel (Cy5), {G(i,j) i= 1,2,...,m; j=1,2,...,n}. We use R(i,j) to refer to the pixel at row i column j of an image R. We define I as the image obtained after combining R and G using some arbitrary function f(.,.), i.e. I(i,j) =f(R(i,j), G(i,j)).
An Improved Clustering-Based Approach for DNA Microarray Image Segmentation
19
Assume we deal with c classes or clusters each representing one of the c categories of pixel intensities. In general, it is assumed that there are two clusters of interest, namely and which represent foreground and background pixels respectively. In our model, we use a real-valued, d-dimensional feature vector to represent the features that we can extract from a pixel The problem of image segmentation consists of assigning each pixel of R, G, or I, to one of the pre-defined classes, In particular, if we are dealing with the two-class problem, the result of the segmentation method will be a black and white or binary image B, {B(i,j) i=1,2,...,m; j=1,2,...,n where B(i,j) equals to either 0 or 255}. After the label of classes is assigned to every pixel in the image, the foreground and background intensities can be computed using many different statistical measures for the two sets.
3 Existing Microarray Image Segmentation Approaches To deal with the microarray image segmentation problem, many approaches have been proposed. While we briefly describe below the most widely used techniques in this direction, a comprehensive survey can be found in [10]. Fixed circle segmentation is a traditional technique that was first used in ScanAlyze [8]. This method assigns the same size (diameter) and shape (circle) to all the spots. GenePix [1] and ScanArray Express [9] also provide the option for fixed circle method. Another method that was proposed to avoid the drawback of the fixed circle segmentation is the adaptive circle segmentation technique. This method considers the shape of each spot as a circle, where the center and diameter of the circle are estimated for each spot. An implementation of this approach can be found in GenePix, ScanAlyze, ScanArray Express, Imagene, and Dapple [4]. Adaptive circle segmentation involves two steps. First, the center of each spot needs to be estimated. Second, the diameter of the circle has to be adjusted. Since the two above-mentioned methods are limited to circular spots, other techniques that deal with “free-shape” spot segmentation have been introduced. One of these methods is seeded region growing (SRG). This method has been successfully applied to image segmentation in general, and has recently been introduced in microarray image processing. In this method, the foreground seed is chosen as the center of the horizontal and vertical grid line. The background seed is chosen as the point in which the grid lines intersect. After obtaining the seeds, the process is repeated simultaneously for both foreground and background regions until all the pixels are assigned to either foreground or background [14]. Another technique that has been successfully used in microarray image segmentation is the histogram-based approach. Using histograms to classify a pixel into either foreground or background is a simple and intuitive idea. Chen et al. introduced a method that uses a circular target mask to cover all the foreground pixels, and computes a threshold using Mann-Whitney test [7]. If the pixel intensity is greater than a certain threshold, it is assigned to the foreground region; otherwise it is assigned to the background region.
20
L. Rueda and L. Qin
Another technique that has been efficiently used in microarray image segmentation is clustering, showing some advantages when applied to microarray image segmentation, since they are not restricted to a particular shape and size for the spots. It can be seen as a generalization of the histogram-based approach. Although a clustering method has been recently proposed in microarray image analysis [13], no commercial microarray processing software has adopted this method yet. In this paper, we propose an optimized clustering-based method for microarray image segmentation. We study the use of a multi-dimensional feature space to represent the characteristics of the pixels in a spot, and the effect of applying different clustering approaches. Wu et al. used a k-means clustering algorithm in microarray image segmentation [13], which we refer to as single-feature k-means clustering microarray image segmentation (SKMIS). They attempt to cluster the pixels into two groups, one for foreground, and the other for background. Thus, in SKMIS, the feature vector is reduced to a single variable in the Euclidean one-dimensional space. The first step of SKMIS consists of initializing the class label for each pixel and calculating the mean for each cluster. Let and be the minimum and maximum values for the intensities in the spot. If is assigned to foreground, or equivalently the label of pixel is set to ‘1’. Otherwise, belongs to background, thus is labeled ‘2’. After this process, the mean (or centroid) for each class, foreground or background, is calculated as follows:
Despite this method requires initialization and an iterative process, it is quite efficient in practice. After the initialization, the second step of the algorithm is the recalculation of the means and the adjustment of the label of each pixel by the following criteria. Assign for all the whose label is ‘ 1’, if
otherwise assign observed.
This step is repeated until no change in the means has been
4 Optimized Clustering-Based Microarray Segmentation Traditional image processing algorithms have been developed based on the information of the intensity of the pixel only. In the microarray image segmentation problem, we encountered that the position of the pixel, for example, could also influence the result of the clustering, and subsequently that of the segmentation. The use of this kind of features has also been successfully applied, for example, for segmentation of nature pictures in [6].
An Improved Clustering-Based Approach for DNA Microarray Image Segmentation
21
In our analysis, we also consider the shape of the spot, the pixels whose distance to the center of the spot is smaller are more likely to be foreground pixels. We can thus take this spatial information about the pixels into account, and construct different features. For example, we can take the Manhattan distance as one of the features, i.e. the distance from the pixel to the center of the spot in the x-axis direction and in the yaxis direction. Alternatively, we can take the Euclidean distance from the pixel to the
spot center as a feature. In this case, the spot center refers to the weighted-mean of the coordinates using the intensity as the weight. The coordinates of the spot center are computed as follows: where is the intensity of the pixel, and contains its coordinates. When we consider the fact that a pixel with most of its surrounding pixels belonging to the same cluster is likely to belong to the same group, we take into account the mean of the surrounding pixels within a certain distance and the variance of the intensities. Adjusting the size of the surrounding regions, we have different values of mean and variance for the pixel as its features. The following equation is used to calculate the distance from the pixel to the spot center.
In our model, the feature vector is given by where is the pixel intensity, and is the distance from c to as obtained in (3). We call the k-means algorithm that uses x as the feature vector optimized k-means microarray image segmentation (OKMIS).
5 Experiments on Benchmark Microarray Data In order to compare OKMIS and SKMIS, where the latter uses the intensity only, we ran both methods on the 1230c1G/R microarray image obtained from the ApoA1 data [5]. For most of the spots, the noisy pixels are excluded from the foreground when considering the information of pixel coordinates. More importantly, for some spots, using the intensity only leads to poor results. This case is shown in Fig. 1, in which it is clear that the two features obtained from the intensity and distance from the center can retrieve the true foreground from the background of the spots. The complete subgrid of the microarray image, which results from both SKMIS and OKMIS, can be found in [10].
22
L. Rueda and L. Qin
Fig. 1. The result of applying SKMIS and OKMIS to spots No. 136 and 137, extracted from the 1230c1G/R microarray image. It is clear that using intensity and distance from the center as the features can reveal the true foreground for these spots.
To obtain a more consistent assessment about our segmentation methods, and their comparison with other approaches, we performed some simulations on benchmark microarray images obtained from the ApoA1 data [5]. First of all, we compare the resulting binary images of the two clustering methods to the original microarray image. A few spots from that image are shown is Fig. 2. We observe that in general OKMIS achieves better results. In some cases (Spot No. 10), OKMIS reveals the true foreground region while SKMIS finds only the noisy pixels. In some cases (Spot No.29), OKMIS can result in a foreground region that contains less noisy pixels. In other cases (Spots No. 11, 12, 22), both OKMIS and SKMIS can obtain a reasonable foreground region, where the OKMIS generates a region that is closer in size to the real spot. As can be seen in the figure, OKMIS automatically removed most of the noisy pixels, and it is more efficient than SKMIS, because the latter must perform an additional nosy removal procedure.
Fig. 2. Comparison of SKMIS and OKMIS on some typical spots obtained from the 1230c1G/R microarray image. For spots with high intensity noisy pixels, such as No.10, OKMIS can reveal the true spot foreground instead of the noise produced by SKMIS.
After visually demonstrating that OKMIS generates better results than the SKMIS method, we now provide an objective measurement for a batch of real-life microarray
An Improved Clustering-Based Approach for DNA Microarray Image Segmentation
23
images. Since SKMIS generates foregrounds with a significant number of noisy pixels, in our experiments, we compare the size of the resulting foreground region for both methods. The results are shown in Table 1. The first column of each method contains the total foreground intensity of the green channel, and the second column represents the number of pixels in the foreground region, In the first two columns, we note that the foreground region generated by SKMIS contains many noisy pixels. Thus, a post-processing method has to be applied in order to eliminate the noise. We have applied such a post-processing, and found that OKMIS eliminates most of the noisy pixels – details of these experiments can be found in [10].
In addition to a nearly-noise-free foreground, OKMIS generates a larger foreground region, as we observe in Fig. 2, which is closer to the real spot foreground. Comparing OKMIS with SKMIS, the resulting foreground regions produced by the former are larger than those of the latter for all the images except for the first pair. Thus, in most of the cases, OKMIS results in much better results than SKMIS, even after the foreground correction process.
6 Conclusions We proposed a new microarray image segmentation method based on a clustering algorithm, which we call OKMIS. Its feature space has two features: the sum of the square root of the intensities, and the distance from the pixel coordinates to the true spot center. As shown in the experiments, our method performs microarray image segmentation more accurately than the previous clustering-based approach, SKMIS, and does not need a post-processing stage to eliminate the noisy pixels. The proposed algorithm, which generates quite satisfying results, still has room for improvements. More elaborated feature extraction and normalization schemes can improve the accuracy of a clustering algorithm. When considering more than one feature, the normalization process is very important, not only by scaling, but also analyzing the correlation between each pair of features. In this regard, principal
24
L. Rueda and L. Qin
component analysis (PCA) is a widely used method that could be used to produce even better results. This problem constitutes a possible avenue for future research. An open problem that we are currently investigating is the use of more than two clusters. In this case, an extra step is needed to determine which clusters correspond to foreground and which ones belong to background. Although this is not an easy task, the refined classification may lead to more significant results. An automatic clustering algorithm is desired to evaluate the best number of clusters for each spot and classifying the pixels into foreground and background. Acknowledgments. The authors’ research work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada.
References 1. Axon Instruments, Inc. GenePix 4000A: User’s manual. (1999). 2. Brazma A, and Vilo J.: Gene expression data analysis. FEBS Letters (2000) 480:17-24. 3. Brown, P., Botstein, D.: Exploring the new world of the genome with DNA microarrays. Nat Genet (1999) Jan; 21(1 Suppl):33-37. 4. Buhler, J., Ideker, T., and Haynor, D.: Dapple: Improved Techniques for Finding Spots on DNA Microarrays. Technical Report UWTR 2000-08-05, University of Washington (2000). 5. Callow, M.J., Dudoit, S., Gong, E.L., Speed, T.P., and Rubin, E.M.: Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research, (2000) Vol. 10, No. 12, 2022-2029. 6. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld -- Image segmentation using expectation maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence (2002) 24(8): 1026—1038. 7. Chen, Y., Dougherty, E., and Bittner, M.: Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics, (1997) 2:364-374. 8. Eisen, M.: ScanAlyze User Manual. (1999). 9. GSI Lumonics. QuantArray Analysis Software, Operator’s Manual. (1999). 10. Qin, L.: New Machine-learning-based Techniques for DNA Microarray Image Segmentation. Master’s thesis, School of Computer Science, University of Windsor, (2004). 11. Schena, M.: Microarray analysis. Published by John Wiley & Sons, Inc., isbm 0-4741443-3, (2002). 12. Soille, P.: Morphological Image Analysis: Principles and Applications. Springer (1999). 13. Wu, H., and Yan, H.: Microarray Image Processing Based on Clustering and Morphological Analysis. Proc. of First Asia Pacific Bioinformatics Conference, Adelaide, Australia, (2003)111-118. 14. Yang, Y., Buckley, M., Dudoit, S., and Speed, T.: Comparison of Methods for Image Analysis on cDNA Microarray Data. Journal of Computational and Graphical Statistics, (2002)11:108-136.
A Spatially Adaptive Filter Reducing Arc Stripe Noise for Sector Scan Medical Ultrasound Imaging Qianren Xu 1, M. Kamel1, and M.M.A. Salama 2 1
Dept. of System Design Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, N2L 3G1, Canada
[email protected],
[email protected] 2
Dept. of Electrical and Computer Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, N2L 3G1, Canada
[email protected]
Abstract. Sector scan medical ultrasound images usually have arc stripes, which do not represent the physical structure and thus are a kind of noise. This paper analyzes the source and characteristics of the arc stripes, and then proposes an adaptive filter based on the geometrical properties of these arc stripes. The proposed filter is the weighted summation of radially adaptive filter and common Gaussian filter. The radially adaptive filter aims to reduce the arc stripe noise. The common Gaussian filter is used to counteract the radial stripe artifact produced by the radial filter and suppress the randomly directional noise as well. The weights of the radially adaptive filter and common Gaussian filter are adapted to the proportion between the arc stripe noise and non-directional noise. The results show that the combined filter obviously enhances the image quality and is superior to common Gaussian filter. Keywords: Arc stripe noise, radial noise reduction, sector scan, ultrasonic image
1 Introduction Any erroneous information included in images can be considered as image noise. There are many sources of noise in ultrasonic image, for example, one kind of noise arises from the electronic detection system, which can be minimized by backscattered signal with high amplitude [1]. Another kind of noise source is caused by the constructive and destructive coherent interference of backscattered echoes from the scatters smaller than the resolution size, which produces speckle pattern noise [2], [3]. Most of researches on the noise reduction for medical ultrasound imaging focus on this speckle noise, for example, a series of adaptive filters have been proposed to reduce the speckle noise [4]-[8]. However, a kind of significant noise in sector scan ultrasound images, which shows as a series of arc stripes, has been ignored. These arc stripes do not represent the physical structure of the tissue, thus can be viewed as a kind of noise (as Fig. 1). This arc stripe noise, certainly, affects tissue visualization and thus decreases the diagnostic power of ultrasound imaging. This paper proposes a spatial adaptive algorithm to filter the arc stripe noise based on the characteristics of A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 25–32, 2004. © Springer-Verlag Berlin Heidelberg 2004
26
Q. Xu, M. Kamel, and M.M.A. Salama
this special noise. This denoising filter will benefit ultrasound image processing such as segmentation, edge detection, classification and 3-D imaging.
Fig. 1. The arc stripe noise on the sector scan ultrasound images.
In the following section, the analysis on the source of the arc stripe noise is presented in Section 2, the spatially adaptive filter is proposed in Section 3, and experimental examples are presented in Section 4.
2 Analysis on the Source of the Arc Stripe Noise This arc stripe pattern noise comes from the working mode of the steered sector scanner and wave propagation properties of the ultrasound beam. The ultrasound beam propagates as a longitudinal wave from the transducer surface into the propagation medium, and exhibits two kinds of beam patterns (as Fig. 2 (a)): a slightly converging beam out to a distance (near-field), and a diverging beam beyond that distance (farfield). For an unfocused single-element transducer, the length of the near-field is determined by the transducer diameter d and the ultrasonic wavelength [9]
In far-field, based on the Frauhofer theory, the angle of ultrasound beam divergence is given by
We assume that there are point targets with same size at the single-element transducer situation (as Fig. 2 (a)), the lateral size of image of these point target objects will increase beyond focal zone (as Fig. 2 (b)). If the original size of point target is the lateral image size s in far-field at a distance of l away from focus location will be
Fig. 2 (c) shows an example of convex sector scan ultrasound imaging that illustrates the source of the arc stripes in the ultrasound imaging. Assuming that there are a
A Spatially Adaptive Filter Reducing Arc Stripe Noise
27
series of point targets arranged as arc arrays, the images of the point targets in focal zone will be the same as the original objects, but the lateral size of the image will increase beyond focal zone. These laterally wider images will be superimposed on the far sides of the focal location, and thus these target points that are originally separated will show as arc stripes in far-field and near-field. The arc stripe noise possesses special geometric properties. First, it is of perfect circular symmetry. Second, the intensity and size of the arc stripes change with the radial depth. Theoretically, there is a location (focal zone) without any arc stripes, and the intensity and size of arc stripes will increase with the distance away from the focal zone.
Fig. 2. The properties of ultrasound beam with the sector scan model and the source of the arc stripes. (a) near-field and far-field for single-element transducer, (b) lateral size of image increases beyond focal zone, (c) arc stripes appear beyond focal zone.
3 Method The proposed filter is based on the geometrical characteristics of these arc stripes, and it consists of two components: radially adaptive filtering operators and common Gaussian filtering operator. Several geometric parameters of the sector scan ultrasound image are determined in advance as Fig. 1 (b): circle center of these arc stripe radial depth of inner edge (near field) radial depth of outer edge (far field) radial depth of focal location is the near field length determined by (1); radial depth of any pixel (i, j); azimuth angle for any pixel (i, j).
28
Q. Xu, M. Kamel, and M.M.A. Salama
3.1 Radially Adaptive Filtering Operators Basic Radial Filtering Operators at Special Directions In order to conveniently describe the proposed filtering method, the simplest 3-by-3 mask is used in the following introduction of the filter structures. We use two filtering masks and to reduce arc stripe noise on the horizontal and vertical directions respectively, and two diagonal filtering masks and to reduce arc stripe noise on and directions respectively,
In practices, the coefficients of these filter masks are determined by
The Gaussian standard deviation size of particular image.
and the mask size k are determined by the noise
Radial Filtering Operators at Arbitrary Direction The filtering operator at any azimuth angle is determined by soft weighted summation of neighbor basic radial filtering operators
3.2 Weighted Summation of the Radial Filtering Operator and Gaussian Filtering Operator The radial filter can reduce the arc stripe noise, but there are two limits: a) it is not able to reduce random directional noise effectively, and b) it will produce radial stripe artifacts because of filtering only on radial direction. In order to both counteract the radial stripes produced by the radial filter operator and suppress the non-directional noise, a weighed summation of the radial operator and Gaussian operator is utilized
where is the weight of basic radial filtering operator determined by (7). and are the weight of Gaussian filtering operator and the weight of radial filtering operator respectively, they are determined by the ratio of non-directional and arc stripe noise components. is basic radial filtering operators determined by (4) and (5). is common Gaussian operator as follow,
A Spatially Adaptive Filter Reducing Arc Stripe Noise
29
3.3 Selection of Parameters We need to determine several parameters before implementing the filtering algorithm: the Gaussian filtering standard deviation the radial filtering standard deviation the filtering mask size k, the weight of Gaussian filtering operator and the weight of radial filtering operator The Gaussian Standard Deviations and The standard deviation of Gaussian function is selected in accordance with the size of noise. The selection of is determined by the noise size s [10]
For the component of Gaussian filter to generally reduce random noise, we can set the standard deviation as constant based on the average noise size. For the component of radial filtering, the standard deviation is adaptive with the lateral size s of noise, and the s can be calculated by ultrasonic wave propagation theory as (3), or be simply estimated directly from ultrasound image. Assuming the linear relation between s and l (distance between any pixel and focal location), we can simplify the calculation of for any point (i, j) as
where and are the Gaussian standard deviations for radial filtering mask at location of outer edge inner edge and focal location respectively. The Size k of Filter Mask The size k of filter mask is selected also in accordance with the noise size s, and a rule of thumb is utilized [10]
We can choose an average value of noise size s to get a uniform value k for all filter operators. The Weight and The weight of Gaussian filtering operator and the weight of radial adaptive filtering operator are determined by the ratio of non-directional and arc stripe noise components, and they are normalized as They can be determined by random noise size and arc stripe noise size
30
Q. Xu, M. Kamel, and M.M.A. Salama
4 Examples 4.1 Testing Image A testing image (Fig. 3 (a)) is used to test the performance of the radial adaptive filtering. We set and size of the mask is 9-by-9. Fig. 3 (b) shows that the radial filter can effectively reduce the arc stripes, does not blur radial lines at all on the basic directions (vertical, horizontal, and and only very slightly blur radial line on any other direction.
Fig. 3. Testing image result by radial adaptive filter.
4.2 Real World Medical Ultrasound Images An example of a transrectal ultrasound prostate image is showed in Fig. 4. Fig. 4 (b) and (c) are the masks on inner and outer edge respectively, which includes the radial filtering component and common Gaussian filtering component. Fig. 4 (d) is the filtering mask at focal zone, and there is no arc stripe noise at this location, so there is only Gaussian filtering component. Fig. 4 (e) shows the result by Gaussian smoothing filter, which reduces noise but blur the useful image as well. Fig. 4 (f) shows the result by the proposed filter, which can effectively suppress both random noise and arc stripe pattern noise. Therefore, the proposed filter has superior performance to Gaussian filter on the sector scan ultrasound image. Fig. 5 shows another example of fetus ultrasound image, we can see that the image detail is deburred after the arc stripe noise is reduced.
A Spatially Adaptive Filter Reducing Arc Stripe Noise
31
Fig. 4. The proposed filtering maskers at and the results of the proposed filtering and common Gaussian filtering. (a) original transrectal ultrasound image of prostate, (b) filtering mask at inner edge, (c) filtering mask at outer edge, (d) filtering mask at focal location, (e) the result by Gaussian filtering, (f) the result by the proposed filtering.
5 Conclusion This paper identifies a significant noise, the arc stripes in sector scan medical ultrasound image, and generalizes the characteristics of the arc stripe noise. The proposed filtering algorithm deals with the arc stripe noise by utilizing the geometric characteristics of the special noise, and the parameters of the filter are adapted with the radial depth in order to effectively smooth noise and deblur the useful image detail. The results show that the proposed filter obviously enhances image quality and is superior to common smoothing filter.
32
Q. Xu, M. Kamel, and M.M.A. Salama
Fig. 5. The result of the proposed filter on fetus ultrasound image
References Webb, A.: Introduction to biomedical imaging. Wiley-Interscience, Hoboken, NJ. (2003) ch. 3, 133-135 2. Burckhardt, C. B.: “Speckle in ultrasound B-mode scans,” IEEE Trans. Son. Ultrason., vol. SU-25, (1978) 1-6 3. Abbott, J. G., Thurstone, F. L.: “Acoustic speckle: Theory and experimental analysis,” Ultrason. Imag., vol. 1 (1979) 303-324 4. Huang, H. C., Chen, J. Y., Wang, S. D., Chen, C. M.: “Adaptive ultrasonic speckle reduction based on the slope-facet model,” Ultrasound in Med. & Biol., Vol. 29, (2003) 11611175 5. Karaman, M., Alper, K. M., Bozdagi, G.: “An adaptive speckle suppression filter for medical ultrasound imaging, ” IEEE Trans. Med. Imaging., vol. 14 (1995) 2832-92 6. Kotropoulos, C.: “Nonlinear ultrasonic image processing based on signal-adaptive filters and self-organizing neural networks, ” IEEE Trans. Image Process, vol. 3 (1994) 65-77 7. Loupas, T., McDicken, W. N., Allan, P. L.: “An adaptive weighted median filter for speckle suppression in medical ultrasonic images, ” IEEE Trans. Circ. Sys., vol. 36 (1989) 129-135 8. Chen, Y., Yin, R., Flynn, Broschat, P., S.: “Aggressive region growing for speckle reduction in ultrasound images, ” Pattern Recognition Letters, vol. 24 (2003) 677-691 9. Bushberg, J. T., Seibert, J. A., Leidholdt Jr., E. M., Boone, J. M.: The Essential Physics of Medical Imaging, 2nd ed. Lippincott, Williams & Wilkins, Philadelphia (2002) ch.16, 490-500 10. Seul, M., O’Gorman, L., Sammon, M. J.: Practical algorithms for image analysis: description, examples, and code. Cambridge University Press, Cambridge, UK. (2000) ch.3, 6874 1.
Fuzzy-Snake Segmentation of Anatomical Structures Applied to CT Images Gloria Bueno, Antonio Martínez-Albalá, and Antonio Adán Universidad de Castilla-La Mancha E.T.S.I. Industriales - Avda. Camilo José Cela, 13071 Ciudad Real - E
[email protected]
Abstract. This paper presents a generic strategy to facilitate the segmentation of anatomical structures in medical images. The segmentation is performed using an adapted PDM by fuzzy c-means classification, which also uses the fuzzy decision to evolve PDM into the final contour. Furthermore, the fuzzy reasoning exploits a priori statistical information from several knowledge sources based on histogram analysis and the intensity values of the structures under consideration. The fuzzy reasoning is also applied and compared to a geometrical active contour model (or level set). The method has been developed to assist clinicians and radiologists in conformal RTP. Experimental results and their quantitative validation to assess the accuracy and efficiency are given segmenting the bladder on CT images. To assess precision, results are also presented in CT images with added Gaussian noise. The fuzzy-snake is free of parameter and it is able to properly segment the structures by using the same initial spline curve for a whole study image-patient set.
1
Introduction
Image segmentation is a key problem in many computer vision and medical image processing tasks, [1]. The particular interest in this study is the localization of therapy relevant anatomical structures in 2D CT images, which still is one of the most widely used techniques for radiotherapy treatment planning (RTP), [2]. The localization of these structures is often performed manually, turning into a tedious work. The automatic delineation has been considered by many authors who report success for the segmentation of particular imaging modalities and ROIs, but at present there is no universally accepted and proved method, [2,3, 4]. In this study, the usefulness of fuzzy theory and active contour models (ACM) or snake has been investigated to address this problem. ACM were originally presented by Kass et al. [5] and since then it has been widely used. However, traditional snake models show to be limited in several aspect, such as: their sensitivity to the initial contours, they are non-free parameters and their equilibrium is not guaranteed. Some techniques have been proposed to solve these drawbacks. These techniques are based on information A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 33–42, 2004. © Springer-Verlag Berlin Heidelberg 2004
34
G. Bueno, A. Martínez-Albalá, and A. Adán
fusion, dealing with ACM in addtion to region properties, [10,11], and curvature driven flows, [6,7,8,9]. Nevertheless, as far as we are concern, none of them exploits fuzzy-snake models in conjunction as it has been described here. Our segmentation approach is based on an AC evolving constrained to a fuzzy intensity image, This image is obtained by a previous fuzzy Cmeans (FCM) clustering algorithm. Moreover, the fuzzy reasoning is also used for final contour convergence. Comparative results of the fuzzy-snake model against a geometrical ACM constrained by the FCM are given. The method may be applicable to a wide class of X-ray CT segmentation tasks for RTP. As a first application, here it is being tested for segmenting therapy structures of the pelvic area on 2D CT image data, comprising 77 CT studies from 11 patients being treated for cancer. Next section briefly explains the fuzzy active contour models, both the FCM clustering and the AC models employed (snake and level set). Section 3 is devoted to the presentation of the results on CT images for segmenting a relevant organ of the pelvic area, that is, the bladder. A quantitative evaluation by suitable statistical techniques is also described. Finally, the conclusions are drawn in Section 4.
2 2.1
Fuzzy Active Contour Model Segmentation Fuzzy C-Means Clustering
Fuzzy set theory provides a powerful mathematical tool for modelling the human ability to reach conclusions when the information is imprecise and incomplete. That is sometimes the case of medical images with noise, low contrast densities and therefore ill-defined shapes, [1,11]. An unsupervised FCM clustering algorithm is used to clusters image data, into specified classes. This is achieved by computing a measure of membership, called fuzzy membership, at each pixel, [12]. The fuzzy membership function constrained to be between 0 and 1, reflects the degree of similarity between the data value at that location and the centroid of its class. Thus, a high membership value near 1 means that the pixel intensity is close to the centroid for that particular class. The FCM algorithms is then formulated as the minimization of the squared error with respect to the membership functions, U, and the set of centroids,
where is the membership of weighting exponent of each fuzzy membership.
in class and is a and are defined as:
Fuzzy-Snake Segmentation of Anatomical Structures
35
Iterating through these conditions leads to a grouped coordinate descent scheme for minimizing the objective function. The stop criterion is determined for each iteration by where:
Once the method has converged a matrix with the membership or degree to which every pixel is similar to all of the classes is obtained. Finally, the maximum membership, is assigned for FCM segmentation. The result of the FCM algorithm may be quite variable according to the number of selected clusters, and the position of the centroids, {V}. In order to apply the FCM algorithm to the problem of interest within this research a proper configuration of both and {V} have been found from analysis of the image histogram. A previous research, [13], shown the existing correspondence between the different peaks within the histogram and the ROI through the whole CT image set of the pelvic area. Thus, a five-peak model was applied to automatically find the maximums and minimums values of the histogram and thereby the set of centroids, In order to achieve a proper clustering, including all therapy relevant regions without loose of information, {V} = 15 clusters were considered. Fig. 1 (c) shows the results of the FCM clustering applied to the original 2D CT image (Fig. 1 (a) - axial view of the human pelvic area), Fig. 1 (b) displays the histogram of the CT image. The FCM clustering results will be used as fuzzy intensity images by the snake model, as explained below.
2.2
Active Contour Model
The basic idea in the model is to evolve a curve, subject to constraints from a given image in order to detect ROI within Initially a curve is set around the ROI that, via minimization of an energy functional, moves normal to itself and stops at the boundary of the ROI. The energy functional is defined as:
The first term, represents the internal energy of the spline curve due to mechanical properties of the contour, stretching and bending. has been calculated in the same way as the classical model, [5]. That is, sum of two components, the elasticity and rigidity energy:
can be expressed as a weighted combination of energy functionals:
G. Bueno, A. Martínez-Albalá, and A. Adán
36
The purpose is to attract the snake to lines, edges and terminations depending on the highlighted characteristics of the structure under consideration. This is achieved by adjusting the weights, and which provides a wide range of snake behaviour. The three energy functionals are defined as:
where: is a Gaussian of standard deviation and is the curvature of lines in a smoothed image. Thus, is used in order to find terminations of line segments and corners. Both terms defining and are also calculated like Kass et. al., [5]. The image functional is usually defined as the image intensity itself, here we use the fuzzy-c mean clustering image, The aim is to create a stronger potential by highlighting ROI edges. It is expected the snake will be attracted towards these edges. The last term, comes from external constraints and it has been defined by a sum of two components:
The functionals are defined as:
where: is the distance from the snake points to the ROI’s center of gravity enclosed by the initial spline curve (snake region). Thus, the direct the AC towards a user-defined feature. Finally, the improves the snake stability, [14], and is given by a linear pressure, based on the statistical properties of the snake region, and that is:
In addition to the criteria of minima energy the FCM cluster membership is also taken into account for snake convergence. Thus, among the entire minima energy points candidate for final boundary, the preferred ones are those belonging to the corresponding cluster. This cluster is automatically set according to both, the grey level and size of the anatomical structure under consideration.
2.3
Geometric Active Contour Model
Recently, there has been an increasing interest in level set segmentation methods. Level set, introduced in [7], involve solving the AC minimization Eq. (4) by the computation of minimal distances curve. Thereby, the AC evolves following the geometric heat flow equation. Caselles et al, [9], derive the equivalence of geometric AC to the classical AC by first reducing the minimization to the following form:
Fuzzy-Snake Segmentation of Anatomical Structures
37
Where is a function of the image gradient used for the stopping criterion. By using Euler-Lagrange, and defining an embedding function of the curve the following equation for curve/surface evolution is derived:
where is the image-dependent balloon force added to move the contour to flow outward and is the curvature. Eq.(12) is the level sets representation of the modified solution of the problem (11). In this research the FCM is used for the stopping criterion, then:
The advantage of using a level set representation is that the algorithm can handle changes in the topology of the shape as the surface evolves in time and it is less sensitivity to the initialisation. However, there are also many drawbacks in terms of efficiency and convergence, [8]. It is also non-free parameters, is dependent of the time step, the spatial one or narrow band (NB). The results may be quite variable, according to the selected parameter, see Fig. 3. The proposed algorithm provide an improved stopping criterion for the active curve evolution, even though the equilibrium is not always guaranteed, (Fig. 4).
3
Experimental Results
Although quantitative evaluation of medical segmentation algorithms is an important step towards establishing the validity and clinical applicability of an algorithm, few researchers subject their algorithms to rigorous testing. The problems often associated are: lack of ground truth, (GT), difficulty in defining a metric and tedious data collection, [15,16]. This research has considered the evaluation by analysing: accuracy (validity), efficiency (viability) and precision (reliability).
3.1
Accuracy and Efficiency Evaluation
A real image database comprising different CT studies has been used for evaluation. The result segmenting the bladder in one of the study image set is shown in Fig. 2. The initial spline curve consist of 32 points set by the user around the bladder on one of the image from the study, usually the middle one, and it is the same for the whole set, (Fig. 2(a)). The parameters from the classical snake model have been adjusted according to the image view but they are the same in the fuzzy-snake model for the whole set. This capability is an important advantageous since the tuning of parameters is usually an undesirable task in medical applications. Thus, for both models,
38
G. Bueno, A. Martínez-Albalá, and A. Adán
for the classical snake and for the fuzzy-snake model. The results may be compared to the classical snake model (Fig. 2(b)). It is worthy to mention how in Fig. 2(b) view 4, the fuzzy-snake model have driven the initial curve through a bony area converging closer to the bladder contour. The method may also compared against the level set model Fig. 3 and under similar FCM framework, Fig. 4. Results shows better performance for the fuzzy-snake model in terms of efficiency and accuracy.
Fig. 1. (a) Original CT Images.
(b) Histogram.
(c) FCM Clustering Image.
The efficiency is assessed by means of the computational time required for segmenting each slice. The number of iterations and computational time, computed in a Pentium 4, 2.6 GHz, for the CT images are shown in the figures. The fuzzy-snake segmentation is slightly higher due to the calculation of the FCM clustering. Nevertheless, this time is small compared with actual manual delineation. It is estimated that outlining the Prescribed Target Area (PTA) for each ROI on a data set of 60 slices take between 17-40 min, [4]. The data set (PTA) was delineated twice by 5 clinicians. In Fig. 5 (a-b) the fuzzy-snake results may be compared against a typical manual delineation, Fig. 5 (c). Then the multiple clinicians’ outlines were averaged to generate the GT. The procedure used to average the contours is similar to the one described by Chalana et al. [15] and is based on establishing one-to-one correspondence between the points constituting the different curves to be averaged. The segmentation was assessed by statistical analysis of the ratio of detected pixels and the distance between boundaries, [13]. An average of 0.91 ± 0.18 true positive detections for the automatic segmentation against 0.88±0.22 for the manual one was obtained.
3.2
Precision Evaluation
To assess the reability of the method, the algorithm was tested with CT images corrupted by additive Gaussian noise. Fig. 6 (a) shows a noise CT image and the initial AC superimposed onto it. The fuzzy-snake model is able to find the
Fuzzy-Snake Segmentation of Anatomical Structures
39
Fig. 2. (a) Initial Snake. (b) Final Classical Snake. (c) Final fuzzy-snake contour.
Fig. 3.
Level Set Segmentation, {view1, view2, view4 of the sample CT image set}.
40
Fig. 4.
G. Bueno, A. Martínez-Albalá, and A. Adán
Fuzzy Level Set Segmentation, {view1, view2, view4 of the sample CT set}.
Fig. 5. Computer Segmentation Compared to Manual Delineation.
Fuzzy-Snake Segmentation of Anatomical Structures
41
Fig. 6. (a) CT noise image, (b) Fuzzy-Snake, (c) Fuzzy-Level Set Segmentation.
contours after 52 iterations and 19 seconds, (see Fig. 6 (b)), while the fuzzy-level set shows unstable equilibrium, it converges at 394 iterations after 184 seconds, (Fig. 6 (c)). Usually, classical AC fail when noise traps the contour, the FCM shows to overcome this problem.
4
Conclusion
The usefulness of using fuzzy theory and AC models by means of a fuzzy-snake model has been investigated for segmenting anatomical structures in CT images for RTP. The segmentation approach is based on an AC evolving constrained to a fuzzy intensity image based on a fuzzy C-means clustering algorithm. Moreover, the fuzzy reasoning in addition to statistical information is also used for final contour convergence. The model has tried to address some of the drawbacks found in traditional snake models. Hence, it has shown to minimize the sensitivity to the initialisation, free parameters and guarantee the equilibrium. The method has been qualitative compared against a level set approach. Further quantitative evaluation has been carried out by suitable statistical techniques in order to test the applicability of the method. Thus, the method has been validated on a database of 77 pelvic CT images of 11 patients by comparing the computer generated boundaries against those drawn manually. The analysis shows good results yielding gains in reproducibility, efficiency and time. Nevertheless, there is still place to improve the model and update the advantageous of other geometrical active contours models, such as preservation of topology, multi-object segmentation and reach curvature lines. Moreover, further analysis is being performed for all the therapeutic ROI in the pelvic area. Acknowledgement. This research has been funded thanks to the projects INBIOMED ISCIII-G03/160 and JCCM/PBI-03-017. We will like also to thank the clinicians at Walsgrave Hospital who delineated the CT images.
References 1. Duncan J.S., Ayache N.: Medical Image Analysis: Progress over Two Decades and the Challenges Ahead. IEEE Trans. on PAMI 22 (2000) 85–106
42
G. Bueno, A. Martínez-Albalá, and A. Adán
2. Lee C., Chung P., Tsai H.: Identifying Multiple Abdominal Organs from CT Image Series Using a Multimodule Contextual Neural Network and Spatial Fuzzy Rules. IEEE Trans. on Information Technology in Biomedicine 7 (3) (2003) 208–217 3. Purdy J.A.: 3D Treatment Planning and Intensity-Modulated Radiation Therapy. Oncology 13 (1999) 155–168 4. Haas O.: Radiotherapy Treatment Planning. New System Approaches. SpringerVerlag Pb. (1998) 5. Kass M., Witkin A., Terzopoulos D.: Snakes: Active contour models. Int. J. Comput. Vis., 14 (26) (1988) 321–331 6. Yu Z., Bajaj C.: Image Segmentation Using Gradient Vector Diffusion and Region Merging. IEEE Int. Conference on Pattern Recognition (2002) 7. Malladi R., Sethian J. A., Vemuri B. C.: Shape Modeling with Front Propagation: A Level Set Approach. IEEE Trans. on PAMI 17 (1995) 158–175 8. Wang H., Ghosh B.: Geometric Active Deformable Models in Shape Modeling. IEEE Trans. on Image Processing 9 (2) (2000) 302–308 9. Caselles V., Kimmel R., Sapiro G.: Geodesic Active Contours. Int. J. Comput. Vis. 22 (1) (1997) 61–79 10. Ray N., Havlicek J., Acton S. T., Pattichis M.: Active Contour Segmentation Guided by AM-FM Dominant Componente Analysis. IEEE Int. Conference on Image Processing (2001) 78–81 11. B. Solaiman B., Debon R. Pipelier F., Cauvin J.-M., Roux C.: Information Fusion: Application to Data and Model Fusion for Ultrasound Image Segmentation. IEEE Trans. on BioMedical Engineering 46 (10) (1999) 1171–1175 12. Mohamed N.: A Modified Fuzzy C-Means Algorithm for Bias Field Estimation and Segmentation of MRI Data. IEEE Trans. on Medical Imaging 21 (3) (2002) 193–200 13. Bueno G., Fisher M., Burnham K. Haas O.: Automatic segmentation of clinical structures for RTP: Evaluation of a morphological approach. MIAU Int. Conference. U.K. 22 (2001) 73–76 14. Ivins J., Porrill J.: Active Region Models for Segmenting Medical Images. IEEE Trans. on Image Processing (1994) 227–231 15. Chalana V., Linker D.T.: A Multiple Active Contour Model for Cardiac Boundary Detection on Echocardiographic Sequences. IEEE Trans. on Medical Imaging 15 3 (1996) 290–298 16. Udupa J.K., LeBlancb V. R., Schmidt H., Ying Y.: Methodology for Evaluating Image Segmentation Algorithms. Proceed. of SPIE 4684 (2002) 266–277
Topological Active Volumes for Segmentation and Shape Reconstruction of Medical Images N. Barreira and M.G. Penedo Grupo de Visión Artificial y Reconocimiento de Patrones (VARPA) LFCIA, Dep. Computación, Fac. de Informática, Universidade da Coruña {noelia, cipenedo}@dc.fi.udc.es http://varpa.lfcia.org
Abstract. This paper presents a new methodology for automatic 3D segmentation and shape reconstruction of bones from tomographic cross-sections. This methodology uses the Topological Active Volumes model. The model is based on deformable models, it is able to integrate the most representative characteristics of the region-based and boundary-based segmentation models and it also provides information about the topological properties of the inside of detected objects. This model has the ability to perform topological local changes in its structure during the adjustment phase in order to: obtain a specific adjustment to object’s local singularities, find several objects in the scene and identify and delimit holes in detected structures. Keywords: medical image segmentation, 3D reconstruction, active nets, active volumes.
1 Introduction Volume segmentation is an important task of medical applications for diagnosis and analysis of anatomical data. Computed tomography (CT), magnetic resonance imaging (MRI) and other imaging techniques provide an effective means of non-invasively mapping the anatomy of a subject. This allows scientists to interact with anatomical structures and obtain information about them. The role of medical imaging has expanded beyond the simple visualisation and inspection of anatomical structures so that it has become a tool for surgical planning and simulation, intra-operative navigation, radiotherapy planning, and for tracking the progress of disease. Segmentation of medical images is a difficult problem because of the sheer size of the datasets and the complexity and variability of anatomic organs. Moreover, noise and low contrast of sampled data may cause the boundaries of anatomical structures to be indistinct and disconnected. The aim of any segmentation method is to extract boundary elements belonging to the same structure and integrate these elements into a coherent and consistent model of the structure. There are many approaches to the segmentation problem. Nowadays, two of the most promising approaches to computer-assisted medical image analysis A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 43–50, 2004. © Springer-Verlag Berlin Heidelberg 2004
44
N. Barreira and M.G. Penedo
are the deformable models [1,2,3,4,5] and the level set methods [6,7]. On one hand, deformable models are curves, surfaces or solids defined within an image or volume domain and they are deformed under the influence of external and internal forces. On the other hand, level sets are numerical techniques designed to track the evolution of isosurfaces, i. e., surfaces of voxels with intensities equal to an isovalue in a 3D volume. The level set models are flexible and can easily represent complex surface shapes but they are more complicated than parametric ones. In this paper a new three-dimensional deformable model, the Topological Active Volumes, is used to perform tasks of segmentation and 3D shape reconstruction of medical images. It tries to solve some intrinsic problems to deformable models. First of all, it solves the initialisation problem: in this model the initialisation is always the same and includes the whole image. Second, it integrates information of edges and regions in the adjustment process in order to take advantage of both methods. The model allows to obtain topological information inside the objects found. The model also has a dynamic behaviour allowing topological local changes in order to perform accurate adjustments and to find all the objects of interest in the scene. This paper is organised as follows. Section 2 describes the model and the mechanisms that govern their behaviour. Section 3 explains the methodology used in the segmentation and reconstruction process. Section 4 presents several examples of image segmentation and reconstruction of medical images and finally the conclusions are exposed in section 5.
2
Topological Active Volumes (TAV)
The model presented in this paper is an extension of Topological Active Nets [8] to three-dimensional world. Its operation is focused on extraction, modelization and reconstruction of volumetric objects present in the scene. A Topological Active Volume (TAV) is a three-dimensional structure composed by interrelated nodes where the basic repeated structure is a cube (figure 1). Parametrically, a TAV is defined as where The state of the model is governed by an energy function defined as follows:
where and are the internal and the external energy of the TAV, respectively. The former controls the shape and the structure of the net. Its calculus depends on first and second order derivatives which control contraction and bending, respectively. The internal energy term is defined by:
Topological Active Volumes for Segmentation and Shape Reconstruction
45
where subscripts represents partial derivatives and and are coefficients controlling the first and second order smoothness of the net. In order to calculate the energy, the parameter domain [0,1] × [0,1] × [0,1] is discretized as a regular grid defined by the intern ode spacing and the first and second derivatives are estimated using the finite differences technique in 3D.
Fig. 1. A TAV grid
On the other hand, represents the characteristics of the scene that guide the adjustment process. As can be seen in figure 1, the model has two types of nodes: internal and external. Each type of node is used to represent different characteristics of the object: the external nodes fit the surface of the object and the internal nodes model the internal topology of the object. So the external energy would have to be different for both types of nodes. This fact allows the integration of information based on discontinuities and information based on regions. The former is associated to external nodes and the latter, to internal nodes. In the model presented, this energy term is defined as:
where and are weights, is the intensity value of the original image in the position is the neighbourhood of the node and is a function associated to the image intensity and defined differently for both types of nodes. On one hand, if the objects to detect are dark and the background is bright, the energy of an internal node will be minimum when it is on a point with a low gray level. On the other hand, the energy of an external node would be minimum when it is on a discontinuity and on a light point outside the object. In that situation, function is defined as:
46
N. Barreira and M.G. Penedo
where is a weighting term, and are the maximum intensity values of image I and the image of gradients G, respectively, and are the intensity values of the original image and the image of gradients in the position of the node and is the mean intensity in a cube. is the distance from the position to the nearest gradient in the image of gradients. Both internal and external energy parameters are domain dependent. Otherwise, if the objects are dark and the background is bright, the energy of an internal node will be minimum when it is on a point with a low grey level and the energy of an external node will be minimum when it is on a discontinuity and on a light point outside the object. In such a case, function is defined as:
where the symbols have the same meaning as in equation 4.
3
Methodology
The adjustment process of the TAV has several stages, as shown in figure 2. The first stage consists of placing the three-dimensional structure in the image. The nodes cover the whole image and they are located in such a way that the distance between two neighbours in each dimension is always the same. This way, the model is able to detect the objects present in the image although they were placed at different positions. The energy minimisation is performed locally using a Greedy algorithm. With this algorithm, the energy value for each node is computed in several positions (the current position and its 26 neighbour positions) at every step of the minimisation process and the best one is chosen as the next position of the node.
Fig. 2. Stages in the TAV adjustment process
Topological Active Volumes for Segmentation and Shape Reconstruction
47
The process finishes when the TAV reaches an stable situation, i.e., when the energy of each node in the TAV is minimal. Once the mesh reaches a stable situation, the dimensions of the TAV are recalculated in order to adjust the size of the TAV to the size of the object, i.e., if the object is wider than high, the number of nodes in the x axis will be greater than the number of nodes in the y axis. Once this recalculation is performed, the mesh, which covered the whole image in the beginning, is centred around the object. It allows the obtaining of the same distribution of nodes independently of the position of the object in the image [9]. After that, the minimisation process is repeated.
Fig. 3. Adjustment of the initial TAV in an artificial image. (a) Result of the energy minimisation process after the readjustment of the TAV. (b) Step in the connection breaking process. (c) Detected objects with the initial TAV.
As figure 3(a) shows, the physical characteristics of the mesh do not enable the perfect adaptation of the nodes to the objects so some kind of topological changes on the TAVs are necessary to achieve a good adjustment. This way, the restrictions of a fixed topology can be avoided. The topological changes consist of the rupture of connections between external nodes wrongly placed in order to obtain a perfect adjustment to the surfaces of the objects [10]. Figure 3(b) shows a step in the connection breaking process. This process allows the generation of external holes in the mesh and the detection of several objects in the image as figure 3(c) shows. In this case, a subTAV is created for each detected object in order to improve the adaptation of the mesh (figure 4(a)). Every subTAV behaves like a TAV and repeats the whole process described above. The dimensions of the new meshes are proportional to the size of the objects so we can obtain a better normalised distribution of internal and external nodes into the object. Finally, the 3D reconstruction of the images is based on the coordinates of the external nodes and it was not used any smooth technique to enhance the results. Figure 4(b) shows the final reconstruction of the objects in the example.
48
N. Barreira and M.G. Penedo
Fig. 4. Adjustment of SubTAVs. (a) Initialisation and adjustment of each subTAV. (b) 3D reconstruction of detected objects.
4
Results
The model has been tested with several sets of CT images. A 3D image was obtained from a set of 2D CT images. In all the examples, the same image was used as external energy for both internal and external nodes and the Sobel filter was employed to obtain the gradient images. The parameters were empirically selected. As a result of equation 5, the external energy of the internal nodes reaches a minimum value in a point with a high level of gray, that is, when the internal node is inside the bone. On the other hand, if an external node is on the edges of the bones but outside them in a point with a low level of gray, the value of its external energy will be minimum. The first example consists of 297 femur CT images with 255 gray levels. The parameters used were and The initial TAV had 25 × 25 × 8 nodes and was readjusted to 18 ×15 × 21 nodes. Figure 5 shows the results of the segmentation process. The second example is a set of 342 tibia and fibula CT images with 255 gray levels. The parameters used were and The initial TAV had 30 × 30 × 8 nodes. In the first readjustment the mesh had 21 × 20 × 19 nodes. Two subTAVs were generated for each bone: the tibia and the fibula were segmented using a 18 × 17 × 23 and a 9 × 7 × 23 nets, respectively. The results of this process are shown in figure 6.
Topological Active Volumes for Segmentation and Shape Reconstruction
49
Fig. 5. (a) Some CT slices of the femur used in the segmentation process. (b) Reconstruction of the femur from CT images
Fig. 6. (a) Some CT images of the tibia and fibula used in the segmentation process. (b) Reconstruction of the tibia and fibula from CT images
5
Conclusions
This work presents a new deformable model focused on segmentation and reconstruction of medical images. The model consists of a volumetric structure and has the ability to integrate information based on discontinuities and regions. The model also allows the detection of two or more objects in the image and a good adjustment to the surfaces in each object. This is due to the distinction of two
50
N. Barreira and M.G. Penedo
classes of nodes: internal and external. That distinction allows the assignment of complementary terms of energy to each kind of node which makes possible that internal and external nodes act differently in the same situations. The model is fully automatic and it does not need an initialisation process like other deformable models. Once the TAV fits the object, the connections between the external nodes allow the definition of the surface of the object and its representation using any reconstruction. On the other hand, the internal nodes show the spatial distribution inside the object that allow to obtain a topological analysis of the objects. The model was tested with medical images obtaining a good adjustment to the objects. A readjustment of the parameters and an increase in size is only needed to obtain a better adaptation of the mesh to the objects. Future work includes the use of new basic structures in the mesh like triangular pyramids, and the introduction of graphical principles in nodes’ behaviour to obtain a better representation of the surfaces of the objects. Acknowledgements. This paper has been partly funded by the Xunta de Galicia and the Ministerio de Ciencia y Tecnología through the grant contracts PGIDIT03TIC10503PR and TIC2003-04649-C02-01 respectively.
References 1. X.M. Pardo and P. Radeva. Discriminant snakes for 3D reconstruction in medical images. In ICPR00, volume IV, pages 336–339, 2000. 2. R. Liu, Y. Shang, F. B. Sachse, and O. Dössel. 3D active surface method for segmentation of medical image data: Assessment of different image forces. In Biomedizinische Technik, volume 48-1, pages 28–29, 2003. 3. J. Montagnat and H. Delingette. Globally constrained deformable models for 3D object reconstruction. Signal Processing, 71 (2): 173–186, 1998. 4. M. Ferrant et al. Surface based atlas matching of the brain using deformable surfaces and volumetric finite elements. In MICCAI 2001, 2001. 5. L. Zhukov, I. Guskov J. Bao, J. Wood, and D. Breen. Dynamic deformable models for MRI heart segmentation. In SPIE Medical Imaging 2002, 2002. 6. A. Charnoz, D. Lingrand, and J. Montagnat. A levelset based method for segmenting the heart in 3D+T gated spect images. In FIMH 2003, volume 2674 of LNCS, pages 52–61. Springer-Verlag, June 2003. 7. C.F. Westin, L. M. Lorigo, O. D. Faugeras, W. E. L. Grimson, S. Dawson, A. Norbash, and R. Kikinis. Segmentation by adaptive geodesic active contours. In Proceedings of MICCAI 2000, pages 266–275, 2000. 8. F. M. Ansia, M. G. Penedo, C. Mariño, and A. Mosquera. A new approach to active nets. Pattern Recognition and Image Analysis, 2:76–77, 1999. 9. F. M. Ansia, C. Mariño, M. G. Penedo, M. Penas, and A. Mosquera. Mallas activas topológicas. In Congreso Español de Informática Gráfica, pages 45–58, 2003. 10. N. Barreira and M.G. Penedo. Topological Active Volumes. In Computer Analysis of Images and Patterns, volume 2756 of Lecture Notes in Computer Science, pages 337–344. Springer-Verlag, 2003.
Region of Interest Based Prostate Tissue Characterization Using Least Square Support Vector Machine LS-SVM S.S. Mohamed1, M.M.A. Salama1, M. Kamel1, and K.Rizkalla2 1
University Of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L3G1 (msalama,smohamed) @hivolt.uwaterloo.ca 2
University Of Western Ontario, 1151 Richmond Street, Suite 2, London, Ontario, Canada N6A 5B8
Abstract. This paper presents a novel algorithm for prostate tissue characterization based on Trans-rectal Ultrasound (TRUS) images. A Gabor multi-resolution technique is designed to automatically identify the Regions of Interest (ROI) in the segmented prostate image. These ROIs are the high probable cancerous regions in the gland. Furthermore, statistical texture analysis for these regions is carried out by employing Grey Level Difference Matrix (GLDM), where a set of features is constructed. The next stage is mainly feature selection that defines the most salient subset of the constructed features using exhaustive search. The selected feature set is found to be useful for the discrimination between cancerous and non-cancerous tissues. Least Square Support Vector Machines (LS-SVM) classifier is then applied to the selected feature set for the purpose of tissue characterization. The obtained results demonstrate excellent tissue characterization.
1
Introduction
Transrectal Ultrasound (TRUS), introduced in 1971, provides information about the size and shape of the prostate. In the late 1970’s and early 1980’s the technology progressed, allowing two clearly distinct zones to be identified within the gland, but was deficient in detecting tumours. In the mid 1980’s, higher frequency transducers were introduced, resulting in higher image resolution and better display of zonal anatomy. Since then, TRUS has become the dominant imaging modality for diagnosis of prostatism, detection and staging of prostate cancer. Computer Aided Diagnosis (CAD) for prostate cancer requires four major steps: segmentation, ROI identification, feature analysis, and classification. The accurate detection of prostate boundaries from ultrasound images (Segmentation) plays an A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 51–58, 2004. © Springer-Verlag Berlin Heidelberg 2004
52
S.S. Mohamed et al.
important role in several applications such as the measurement of prostate gland volume. Lots of effort has been done in the segmentation process, which makes it well established [1], [2]. ROI identification is highlighting the most probable cancerous regions in the gland, a step that is normally achieved with the help of expert radiologist. This step is crucial as studying the whole image will lead to distorted features and will not reflect the medical condition. Feature analysis is mainly extracting features from the identified ROI in the TRUS image. These features could be statistical features, spectral features or model based features. The features chosen to be used in this work are second order statistical features obtained using Grey Level Dependence Matrix. The next step in feature analysis is selecting the most salient features among the constructed ones, the feature selection step. Following that is the classification stage which depends mainly on the quality of the selected features and the precision of the classifier. Tissue typing in the TRUS image attempts to differentiate the cancerous and the non-cancerous regions. This paper will be organized as follows: Section 2 covers the ROI identification technique and some sample regions identified TRUS images. Section 3 explains the statistical features extracted from the ROIs. Section 4 describes the feature selection algorithm used in this work and demonstrates the selected features for this specific application. Section 5 presents the LS-SVM algorithm and the classification results. Section 6 concludes the work in this paper. The TRUS images used in this work were obtained from University of Western Ontario UWO and are derived from Aloka 2000 ultrasound machine using a broadband 7MHz linear transducer and a field of view of approximately 6 cm. a set of 32 radiologist identified TRUS images were used for this study.
2
ROI Identification
ROI segmentation is a crucial step for prostate cancer diagnosis this step was performed earlier by the aid trained radiologists. With the aim of fully automating prostate cancer diagnosis there is a great need for ROI segmentation algorithm. Multi-resolution filtering has proven to be an excellent method for texture investigation in the field of image processing. By processing the image using multiple resolution techniques, it is decomposed into appropriate texture features that can be used to classify the textures accordingly [3]. This method is applied to the segmented TRUS image of the suspected patient which is used to identify the ROIs. This is achieved by applying Gabor multi-resolution analysis that is capable of segmenting the image according to the frequency response of the pixels. The pixels that have similar response is assigned to the same cluster. This process segments the prostate image into several regions which are the regions of interest. The Gabor function was chosen for its high localization in both the spatial frequency domain as well as the spatial domain. The Gabor function in the spatial domain is a Gaussian modulated sinusoid. For a 2-D Gaussian curve with a spread of and in the x and y directions, respectively, and a modulating frequency the real impulse response of the filter is given by [4]:
Region of Interest Based Prostate Tissue Characterization
53
While in the spatial-frequency domain, the Gabor function becomes two shifted Gaussians at the location of the modulating frequency. The equation of the 2-D frequency response of the filter is given by:
This algorithm was applied to the available TRUS images and a sample of the segmented images as well as their corresponding ROIs identified images are shown in figure 1. More details about the filter design are explained in [5].
3
Feature Construction
Some important texture features can be extracted from the Grey Level Dependence Matrix (GLDM). This second order statistical approach was applied to several ultrasound image analyses where it has been found effective in a number of applications such as fetal lung maturity determination [6]. GLDMs are matrices whose elements are the probabilities of finding a pixel, which has, grey-tone i at a distance and an angle
from a pixel, which has grey-tone j . The grey-tones within a given image
segment must be quantized to have
grey levels. Each pixel is considered as having
eight nearest neighbors connected to it, except at the periphery. The neighbors can be grouped into the four categories shown in Figure 2. The texture information is contained in the probability density functions or GLDMs, P(i, j). In this work a set of four features are constructed from this matrix (Energy, Entropy, Contrast and Homogeneity). Contrast “a measure of local image variation”
54
S.S. Mohamed et al.
Fig. 1. Two different TRUS segmented images and the corresponding ROIs
Fig. 2. GLDM matrix demonstration
Region of Interest Based Prostate Tissue Characterization
55
Entropy “an inverse measure of homogeneity i.e. measure of information content”
Energy
Homogeneity
4
Feature Selection
The principle of feature selection is to take a set of candidate features and select a subset of features, which retain most of the information needed for pattern classification [7]. In some cases it is possible to derive a subset of features, which forfeit none of the information needed for classification. Such a subset of features is referred to as an optimal set and results in no increase in the minimum probability of error, when a decision rule is applied in both the observation and the subset space. Feature selection is used to select a subset of features from a given set of p features, without significant degradation in the performance of the recognition system. Exhaustive search is used in this work to guarantee the global optimal feature subset. The Feature Selection algorithm applied in this paper is a classifier dependant FS method. This means that all possible feature subsets are obtained and the classifier performance is tested for each subset. Finally the best discriminatory feature subset is chosen. Using the feature set of four features (energy, entropy, contrast and homogeneity) as an input to the algorithm it was found that the best discrimination occurred when using a feature subset composed of contrast and homogeneity.
5
Classification
Considering the output of the feature selection algorithm only the contrast and homogeneity features are used for the classification stage.
56
5.1
S.S. Mohamed et al.
Support Vector Machines
Support Vector Machines are found to be an influential methodology for solving nonlinear classification problems [8]. SVM have been introduced within the framework of statistical learning theory and structural risk minimization; SVM depends mainly on pre-processing the data to represent patterns in a higher dimensionality space, usually much higher than the original feature space. This is achieved with a suitable non-linear mapping to a sufficiently high dimension [8]. Data from two classes are always separated by a hyper-plane. Assume each pattern has been transformed to for each of the
n patterns let
according to whether the pattern is in dinscriminant in the augmented space is given by:
or
a linear
Both the weight vector and the transformed pattern vector are augmented. Thus a separating hyper-plane ensures a margin with any positive distance from the plane. SVM is trained such that the separating plane has the largest margin. It is expected that the larger the margin the better the classification. It has been proven in [8] that the distance between any hyper-plane to a transformed pattern y is and assuming that a margin b exists then:
The goal is to achieve a weight vector a that maximizes b subject to the constraint the support vectors are the training samples that define the optimal separating hyper-plane and are the most difficult patterns to classify. In SVM one solves convex optimization problem typically quadratic programming. LS-SVM was introduced as reformulations of ordinary SVM. The cost function is a regularized least squares function with equality constraints, leading to linear KarushKuhn-Tucker systems [9]. LS-SVM was used as the classifier for the work proposed in this paper because of its ability to deal with noisy and non-linear separable data. The features used for this study were first normalized before using the classifier. The classification results were excellent. The training set is composed of 70 labelled regions while the test set is composed of 16 regions. The overall accuracy obtained was 87.5%. The confusion matrix is shown in table 1. A well known measure for classifier accuracy is the ROC curve which is shown for LS-SVM in figure 3
Region of Interest Based Prostate Tissue Characterization
57
Fig. 3. Receiver Operating Characteristic curve
6
Conclusion
A novel algorithm was implemented for accurate and automated prostate cancer diagnosis using TRUS images. The ROI were identified from the segmented prostate TRUS images using Gabor multi-resolution analysis. Since the Gabor function is highly localized in both spatial and spatial-frequency domains it leads to accurate identified regions. Second order statistical texture features were constructed from these automatically segmented regions of interest using GLDM. Furthermore, a feature subset representing the most salient and uncorrelated features was generated utilizing exhaustive search in order to guarantee global optimal solution. Finally these features were used for tissue typing using the LS-SVM algorithm that has proven great success in dealing with noisy data. The obtained results revealed high accuracy of 87.5% in spite of the limited available data sets. Obtaining better accuracy is expected when more data sets are available.
58
S.S. Mohamed et al.
References 1.
Dinggang Shen; Yiqiang Zhan; Davatzikos, C.;Medical Imaging, “Segmentation of prostate boundaries from ultrasound images using statistical shape model” IEEE Transactions on, Volume: 22 , Issue: 4 , April 2003 Pages:539 – 551 2. Lixin Gong; Pathak, S.D.; Haynor, D.R.; Cho, P.S.; Yongmin Kim; “Parametric shape modeling using deformable superellipses for prostate segmentation” Medical Imaging, IEEE Transactions on , Volume: 23, Issue: 3, March 2004 Pages:340 – 349 3. David A. Clausi, M.Ed Jernigan, “Designing Gabor filters for optimal texture separability” Pattern Recognition 33 (2000) 1835-1849. 4. A.C. Bovik, M. Clark, W.S. Geisler, “Multichannel texture analysis using localized spatial filters” IEEE Trans. Pattern Anal. Machine Intell. 12 (1) (1990) 55-73. 5. S.S. Mohamed, E.F. El-Saadany, T.K. Abdel-Galil,J. Shen, and M.M. Salama, A.Fenster, D.B. Downey, and K. Rizkalla, “Region of Interest Identification in TRUS Images of the Prostate Based on Gabor Filter” IEEE Midwest Symposium on circuits and systems, 2003. 6. Bhanu Prakash, K.N.; Ramakrishnan, A.G.; Suresh, S.; Chow, T.W.P.; “Fetal lung maturity analysis using ultrasound image features” Information Technology in Biomedicine, IEEE Transactions on, Volume: 6 , Issue: 1 , March 2002 Pages:38 – 45 7. R.Duda, P.Hart,D.Stork, “Pattern Classification”, John Wiley and Sons.2001 8. C. Junli, J. Licheng “Classification mechanisms for SVM”, Proceedings of ICSP2000. 9. J.A.K. Suykens, J .Vandewalle, “Least Squares Support Vector Machine classifiers”, Neural Processing Letters Volume: 9, Issue: 3, June 1999, pp. 293-30. 10. K. Pelckmans, J. A. K. Suykens, T. Van Gestel, J. De Brabanter, L. Lukas, B. Hamers, B. De Moor, J. Vandewalle, “ LS-SVMlab Toolbox User’s Guide”, Pattern recognition letters 24 (2003) 659-675
Ribcage Boundary Delineation in Chest X-ray Images Carlos Vinhais1,2 and Aurélio Campilho1,3 1
INEB - Instituto de Engenharia Biomédica, Laboratório de Sinal e Imagem Biomédica, Campus da FEUP, Rua Roberto Frias, s/n, 4200-465 Porto, Portugal 2 ISEP - Instituto Superior de Engenharia do Porto, Departamento de Física, Porto, Portugal
[email protected] 3
Universidade do Porto, Faculdade de Engenharia, Departamento de Engenharia Electrotécnica e Computadores, Porto, Portugal
[email protected]
Abstract. We propose a method for segmenting the ribcage boundary of digital postero-anterior chest X-ray images. The segmentation is achieved by first defining image landmarks: the center of the ribcage and, using polar transformation from this point, two initial points belonging to the ribcage. A bank of Gabor filters (in analogy with the simple cells present in the human visual cortex) is used to obtain an orientation edges enhanced image. In this enhanced image, an edge following, starting from the landmarks previously determined, is performed for delineating the left and right sections of the ribcage. The complete segmentation is then accomplished by connecting these sections with the top section of the ribcage, obtained by means of spline interpolation. Keywords: Ribcage boundary segmentation, polar image transform, probabilistic genetic algorithm, Gabor filters, edge following.
1
Introduction
The automatic delineation of ribcage boundary of digital X-ray chest images provides useful information required for computer-aided diagnosis (CAD) schemes. In chest radiography, CAD schemes have been developed for automated detection of abnormalities, such pulmonary nodules [1]-[3], pneumothorax [4] or cardiomegaly [5]. The thoracic cage boundary delimits the area of search of the ribs and represents a convenient reference frame for locating structures of interest in the chest image. An overview of the literature on lung field segmentation, rib detection and methods for selection of nodule candidates can be found in [6]. Several methods for automatic segmentation of the ribcage boundary have been suggested, based on the approach of edge detection from derivatives [7][8], or dynamic programming and curve fitting [9]. In this paper, we present a novel method for segmenting the ribcage boundary, based on the polar image transform and image landmarks extraction, as A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 59–67, 2004. © Springer-Verlag Berlin Heidelberg 2004
60
C. Vinhais and A. Campilho
explained in Sect. 2 and 3. Using such feature points, the delineation of the right, left and top sections of the ribcage, based on edge following and spline interpolation, is covered in Sect. 4. Finally, we present some results and draw conclusions.
2
Determination of Center of Ribcage
The chest X-ray images used to test the method we present herein are 512 × 512 gray-scale images, represented by the intensity function in a coordinate system. The delineation of the ribcage edges of a chest X-ray image is based on a reference point, that represents approximately the center of the ribcage of the image. The center C is determined by first defining two points, and These landmarks, shown in Fig. 1(a), correspond to the location of the minimum of two intensity profiles: for we consider the intensity profile from the center of the image O to the upper-left corner of the image and, for the intensity profile from O to the upper-right corner of the image. The positions of and are defined by their distance (radial positions) and respectively, from the center O of the image . The center C of the ribcage is then defined by the position of the mean of the cumulative sum of the intensity profile along the segment Fig. 1(b) shows the determined center of ribcage for one of the X-ray image of the database.
Fig. 1. (a) Definition of the center C (cross) of the ribcage, from minima and (circles) of intensity profiles (dotted lines); (b) X-ray intensity image, with the center C of ribcage (cross); (c) Polar transform, of image (b) with origin C.
The determination of the positions of and can be seen as an optimization problem solved with a genetic algorithm (GA) [10]. We decided to use a GA in its probabilistic form (PGA) [11] [12], where the search for the optimal
Ribcage Boundary Delineation in Chest X-ray Images
61
solution is based on an initial population of candidate solutions, randomly selected from an uniform distribution between and pixels. A few generations (iterations) of the PGA are enough to reach convergence.
3
Determination of Ribcage Starting Points
The delineation of ribcage edges, explained in Sect. 4, will start from two points and that should belong, in the supra-clavicular region, to the right and left ribcage sections, respectively.
3.1
Determination of Angular Position of Ribcage Starting Points
We define the starting points and in polar coordinates with respect to the center C of the ribcage. The polar transform of the chest intensity image with origin is defined as
where
The coordinate of a point in is the radial distance (in pixels) from C of the corresponding point in The coordinate is equal to the angular distance of the corresponding point from the axis in The polar transform of size 360 × 512, of an X-ray image is shown in Fig. 1(c). The resolution is (1°) and 1 pixel, in (vertical top-down) and (horizontal left-right) axis, respectively. The angular coordinates and of the starting points and of the ribcage sections are determined by locating the minima, shown in Fig. 2(a) with arrows, of the intensity projection of onto axis. The search for the values and is solved with another PGA. In this minimization problem, the solutions of the initial population of the PGA are randomly selected, from normal distributions with standard deviation of and mean value of for and for
3.2
Determination of Radial Position of Ribcage Starting Points
The distances and of the starting points and (from C) are determined considering the radial intensity profiles and respectively. These two profiles are superimposed in Fig. 2(b). We first look for the radial coordinates and of the minima location of the intensity profiles and smoothed versions of and respectively. We then determine and with the condition:
62
C. Vinhais and A. Campilho
Fig. 2. (a) Intensity projection of the polar image of Fig. 1(c), onto angular axis; (b) Radial intensity profiles of for the angles indicated by the arrows in (a); (c) Result of starting points (circles) determination for X-ray image of Fig. 1(b).
In Eq. 3, we assumed the width of the image. The search for the minima location, and are again determined with a PGA, this time starting with an initial population of solutions randomly selected from normal distributions, with mean value pixels and standard deviation of 7 pixels. Fig. 2(c) shows the determined starting points for X-ray image of Fig. 1(b). The use of a PGA for solving this optimization problem allows to impose constraints between the values of and and to make them to evolve together. Because of the reflectional symmetry present in X-ray chest images, the two radial intensity profiles should be approximately coincident, as shown in Fig. 2(b).
4
Delineation of Ribcage Boundary
Once the starting points and have been determined, the X-ray image has to be processed for producing a ribcage edges enhanced image, where the delineation of left and right ribcage sections will take place.
4.1
Ribcage Edges Enhancement
The chest X-ray image is filtered with a family of two-dimensional Gabor filters, for ribcage edges enhancement. These filters have been used in several computer vision tasks, including image enhancement [13] and edge detection [14], in analogy with the processing of stimuli by cortical simple cells present in the human visual system.
Ribcage Boundary Delineation in Chest X-ray Images
A receptive field function of such a cell, can be represented by a linear Gabor filter:
63
centered in the origin,
where The angle parameter (see Eq. 5), determines the orientation of the filter (preferred orientation of the simple cell), and is a constant, called the spatial aspect ratio, that determines the ellipticity of the receptive field. The value of the ratio is considered as in [15] to be 0.56, where is the spatial frequency of the cosine factor. In this paper we consider pixels. The value of the standard deviation of the Gaussian factor is imposed by the width of the ribcage edges we want to enhance. The receptive field is sampled within the interval of resulting in a 61 × 61 square kernel Finally, imposing the phaseoffset a symmetric filter is obtained, as shown in Fig. 3(a), for
Fig. 3. (a) Gabor filter for pixels, (b) Spatial response of cortical simple cells, with preferred orientation to X-ray image input of Fig. 1(b). (c) Ribcage edges enhanced image after performing a winner-takes-all orientation competition
We assume that the positive spatial response of a simple cell to the X-ray input intensity distribution with a receptive field (selective) orientation is given by:
where the filtered image
is computed by convolution:
64
C. Vinhais and A. Campilho
The columnar organization of simple (and complex) cells in human visual cortex [16] suggests the filtering of the X-ray input image with a bank of Gabor filters, defined by Eq. 4, with same value of but different values of preferred orientation equally spaced: The number of orientations is The spatial response of simple cells with preferred orientation is shown in Fig. 3(b). The ribcage edges enhanced image shown in Fig. 3(c), is then obtained by performing a winner-takes-all orientation competition approach:
The enhanced image is considered as the representation of the maximum of the cortical cell activities for a given point The image will be now used for the delineation of the ribcage boundary.
4.2
Delineation of Right and Left Ribcage Edges
The left and right sections of the ribcage are delineated on the enhanced image using an edge following technique is now described. Let be a point that belongs to the ribcage. Using Eq. 1, a polar transform of a region of interest (ROI) centered in is performed. We impose a square ROI, Fig. 4(a), of width 61 pixels the same size of the Gabor filters for pixels used in Sect. 4.1 to enhance the input image. The size of the polar image is then 360 × 31, if we consider a resolution of (1°) and 1 pixel, in (vertical top-down) and (horizontal left-right) axis of the image, respectively (see Fig. 4(b)).
Fig. 4. (a) ROI centered in the starting point of the right section of ribcage of Xray of Fig. 1(b); (b) Polar transform of (a); (c) Intensity projection of (b) onto angular axis.
Ribcage Boundary Delineation in Chest X-ray Images
65
The location of the peak of the projection of onto the axis (see Fig. 4(c)) indicates the preferred orientation to follow, for determining the next point of the ribcage section. The radial coordinate the standard deviation of the Gabor filters, is a constant step that determines the distance between the followed points and The procedure is repeated, no more than 50 times starting from the point with and from the point with for the delineation of right and left ribcage sections, respectively. In order to have some control over the edge follower, is previously weighted by the separable gaussian function defined by:
where is the preferred orientation determined for the point The parameter controls the search area around and the curvature of the followed curve. The choice of these parameters is not critical, but requires some experimental adjustment.
4.3
Delineation of Top Section of Ribcage
The delineation of top section requires a different approach because of the complicated structures present in the neck. We decided to consider the first three points, equally sampled each 25 points, from left and right ribcage starting points and A spline interpolation is performed with these points and the top ribcage section is represented by considering the interpolated points between and
Fig. 5. Complete ribcage boundary delineation results for three X-ray images of the database.
Fig. 5 shows the results of the delineation of the complete ribcage boundary, obtained by connecting the left, top and right ribcage edges, for three X-ray images of the database.
66
5
C. Vinhais and A. Campilho
Conclusions
The delineation of the ribcage boundary is strongly dependent of the two starting points and The signature of the projection of the polar image, with origin C, the center of the ribcage, provides enough information about the angular location of and For defining their radial coordinates, we choose to use a PGA to locate the minima and (see Sect. 3.2). The parameters of the PGA are easily adjusted to the initial candidate positions pixels, corresponding to the mean distance from center C to the upper lobe of the lungs. Because of the symmetry exhibited by X-ray chest images, the PGA is suitable to implement a simultaneous search for the two minima. The robustness of the method for extracting these three feature points makes them attractive image landmarks required by some model-based schemes, e.g. active shape models, to work properly in the initialization of the model. The value of the standard deviation pixels of the Gabor filters used to enhance orientation edges in the X-ray image is imposed by the width of the ribcage edges we want to segment. The parameters of the edge following technique we used to delineate the left and right sections, the step and the radius of the search area in each point, are not critical parameters and were fixed to the same value of Experiments were performed with variable step, but no significant improvement has been achieved.
References 1. M. Carreira and D. Cabello, “Computer-Aided Diagnoses: Automatic Detection of Lung Nodules”, Med. Phys., 25 (10), pp. 1998-2006, 1998. 2. X. Xu, “Development of an Improved CAD Scheme for Automated Detection of Lung Nodules in Digital Chest Images”, Med. Phys., 25 (9), pp. 1395-1403, 1997. 3. S. B. Lo, “Artificial Convolution Neural Network Techniques and Applications for Lung Nodule Detection”, IEEE Trans. on Medical Imaging, 14 (4), pp. 711-718, 1995. 4. S. Sanada, K. Doi and H. MacMahon, “Image feature analysis and computer-aided diagnosis in digital radiography: Automated detection of pneumothorax in chest images”, Med. Phys., 19, pp. 1153-1160, 1992. 5. N. Nakaromi, K. Doi, H. MacMahon, Y. Sasaki and S. M. Montner, “Effect on heart-size parameters computed from digital chest radiographs on detection of cardiomegaly: Potencial usefulness for computer-aided diagnosis”, Inv. Radiology, 26, pp. 546-550, 1991. 6. B. van Ginneken, B. H. Romeny and M. A. Viergever, “Computer-aided Diagnosis in Chest Radiography: A survey”, IEEE Trans. on Medical Imaging, 20 (12), pp. 1228-1241, 2001. 7. X. Xu and K. Doi, “Image feature analysis for computer-aided diagnosis: Accurate determination of ribcage boundary in chest radiographs”, Med. Phys., 22 (5), pp. 617-626, 1995. 8. N. Nakaromi, K. Doi, V. Sabeti and H. MacMahon, “Image feature analysis and computer-aided diagnosis in digital radiography: Automated analysis of sizes of heart and lung in chest images”, Med. Phys., 17, pp. 342-35, 1990.
Ribcage Boundary Delineation in Chest X-ray Images
67
9. Z. Yue, A. Goshtasby and L. V. Ackerman, “Automatic Detection of Rib Borders in Chest Radiographs”, IEEE Trans. on Medical Imaging, 14 (3), pp. 525-536, 1995. 10. D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning. Reading, MA: Addison Wiley, 1989. - ISBN 0-201-15767-5. 11. C. Vinhais and A. Campilho, “Optimal Detection of Symmetry Axis in Digital Chest X-ray Images”, 1st Iberian Conference on Pattern Recognition and Image Analysis - IbPRIA 2003, Lecture Notes in Computer Science, Vol. 2652. SpringerVerlag, Berlin Heidelberg New York, pp. 1082-1089, 2003. 12. Y. Gofman and N. Kiryati, “Detecting Symmetry in Grey Level Images: the Global Optimization Approach”, International Journal of Computer Vision (IJCV), 29, pp. 29-45, 1998. 13. G. Cristóbal and R. Navarro, “Space and frequency variant image enhancement based on Gabor representation”, Pattern Recognit. Lett., 15, pp. 273-277, 1994. 14. R. Mehrotra, K. R. Namuduri, and N. Ranganathan, “Gabor filter-based edge detection”, Pattern Recognit., 25 (12), pp. 1479-1494, 1992. 15. N. Petkov and P. Kruizinga, “Computational models of visual neurons specialised in the detection of periodic and aperiodic oriented visual stimuli: bar and grating cells”, Biol. Cybern., 76, pp. 83-96, 1997. 16. M. B. Carpenter, Core Text of Neuroanatomy, 4th ed., Willams & Wilkins, Baltimore, 1991 - ISBN 0-683-01457-9.
A Level-Set Based Volumetric CT Segmentation Technique: A Case Study with Pulmonary Air Bubbles José Silvestre Silva1,2, Beatriz Sousa Santos1,3, Augusto Silva1,3, and Joaquim Madeira1,3 1
Departamento de Electrónica e Telecomunicações, Universidade de Aveiro, Campo Universitário de Santiago, P-3810-193 Aveiro, Portugal {bss, asilva, jmadeira}@det.ua.pt 2
Departamento de Física, Faculdade de Ciências e Tecnologia, Universidade de Coimbra, Rua Larga, P-3004 516 Coimbra, Portugal
[email protected] 3
Instituto de Engenharia Electrónica e Telemática de Aveiro Campo Universitário de Santiago,P-3810-193 Aveiro, Portugal
Abstract. The identification of pulmonary air bubbles plays a significant role for medical diagnosis of pulmonary pathologies. A method to segment these abnormal pulmonary regions on volumetric data, using a model deforming towards the objects of interest, is presented. We propose a variant to the well known level-set method that keeps the level-set function moving along desired directions, with an improved stopping function that proved to be successful, even for large time steps. A region seeking approach is used instead of the traditional edge seeking. Our method is stable, robust, and automatically handles changes in surface topology during the deformation. Experimental results, for 2D and 3D high resolution computed tomography images, demonstrate its performance.
1 Introduction The detection of structures in the human body is difficult, due to their large variability in shape and complexity. Computed tomography (CT) is a medical imaging tool that allows volumetric data acquisition. With the high resolution computed tomography (HRCT) technique and, more recently, with multi-slice spiral CT, it is possible to obtain very thin slices of the thoracic region, having high resolution and contrast between lungs and near structures [1,2]. CT, in particular HRCT, has been one of the most used tools for pulmonary air bubble analysis. With HRCT, it is possible to observe small air bubbles, characterize them (number, dimensions, location), plan their treatment with or without surgery, and monitor their evolution (increase in size, detection of new lesions, or additional abnormalities) [3-5].
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 68–75, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Level-Set Based Volumetric CT Segmentation Technique
69
An important goal in image processing is to detect object shapes in 2D or 3D. One way to achieve this is to detect the associated boundaries using model based techniques. Such techniques can be, for instance, the classical snakes [6, 7] or 3D deformable surfaces [8, 9], based on deforming an initial contour or surface towards the boundary of the object to be detected. Level-set models, also known as geometric deformable models, have tremendous impact in medical imaging due to topology adaptability and fast shape detection, and provide an alternative solution that overcomes the limitations of parametric deformable models. Level-set models are based on curve evolution theory [10]: curves and surfaces evolve using only geometric measures, resulting in an evolution that is independent of the parameterization. The evolving curves and surfaces are represented implicitly as the level-set of a higher-dimensional function [11, 12]. The level-set method offers the advantage of easy initialization, computational efficiency, ability to capture sharp vertices. The convergence to the final result is relatively independent of the initialization. In the present work, we consider a closed, nonintersecting initial hypersurface placed outside the thoracic region. This hypersurface is then allowed to flow along its gradient field with a speed proportional to the hypersurface curvature and relevant image properties. We extend the known algorithms of geometric deformable models with an improved stopping criterion. In order to characterize pulmonary air bubbles, we describe a new approach based on their physical characteristics, using a different initialization criterion and a new stopping function based on region seeking. The efficiency of the method is demonstrated with experiments on real images.
2 The Level-Set Method The level-set model, initially proposed by Osher [10] and first applied to medical images, independently, by Malladi [11] and Caselles [12] is based on the equation:
where
is the level-set function,
is a constant, K is the curvature of defined as:
and P is the stopping function, responsible for pushing the model towards image boundaries:
where I is the image, of the Gaussian.
is a smoothing Gaussian filter and
is the standard deviation
70
J.S. Silva et al.
The stopping function P in (3) only slows the evolution, thus, it is possible that the level-set does not stop at object boundaries, continuing its movement. To overcome this issue, some authors included an additional term [13]:
where and are constants; usually e [13]. To improve the adjustment of the model to the image, a third term was added to the previous expression [12-14]:
where X is a contour, obtained by The last term adds an additional attraction force when the front (defined as is in the neighbourhood of boundaries. With this term, the model behaves better in synthetic images; but in real medical images the latter model is not robust enough to process medical shapes, and, thus, only the first two terms should be used [13].
2.1 The Stopping Function The first level-set model implementation used equation (1) with the stopping function described in equation (3), with Some authors replace K with –K, adjusting the value, when necessary; this change influences the direction of evolution, and the level-set model mainly expands (or shrinks) depending on the sign of K [10, 11]. Common stopping functions are:
or
where m is a positive integer. A control factor of the stopping function is the image gradient. In equation (7), the relation between the stopping function and the gradient is inverse and depends on m. Some authors [15] use equation (6), while others [12, 16] use equation (7) with m=1 and/or m=2; nevertheless Malladi [11] refers both equations (6) and (7) with m=1.
2.2 Extensions Suri [13] describes several extensions for geometric deformable models, which include the derivation of geometric models from parametric deformable models, using internal and external energies/forces, as well as two coupled geometric models among other variations. Geometric models may need large computational resources with the
A Level-Set Based Volumetric CT Segmentation Technique
71
increase in level-set dimension. Several authors describe methods such as the Fast Marching Method or the Narrow Band Method to improve the model evolution speed [11, 13, 14, 16].
3 The Proposed Level-Set Approach Our main purpose is to segment pulmonary air bubbles from HRCT images. The classical level-set method is not able to overcome problems caused by noise and irregular pulmonary structures. To overcome these obstacles, we have developed a level-set approach with a new initialization procedure and a new stopping function using a region seeking criterion, which has produced promising results. Our approach starts by attenuating noise followed by the initialization of the levelset function. Then, the evolution of the level-set is controlled by its position and curvature, and also by the stopping function. Finally, during post-processing, regions smaller than a given threshold are rejected.
3.1 Pre-processing Noise is one of the main obstacles to any segmentation task. To attempt to reduce this problem, we start by smoothing the HRCT data with an average filter. Although noise is not totally removed, it is sufficiently attenuated to not significantly disturb the evolution of our level-set method. We define the initial level-set function which is represented as an array with the same size as the original HRCT data, where each array element is computed as a function of its distance d to the center of the HRCT data set:
where L is the maximum distance (e.g., for a 2D image, L is equal to half its diagonal length, vertex elements have values equal to zero and center elements are equal to L). With this initial function, all level-set values are positive and will be allowed to decrease with a speed proportional to the stopping function and to the level-set divergence.
3.2 Evolution The P stopping function defined by equations (6) or (7) is used by some authors to process synthetic images or well-behaved medical images (i.e., most of their regions have almost uniform intensity). In our case, HRCT thoracic images, specially in pulmonary regions, comprise several non uniform intensity regions. We recall that lungs are one of the organs with larger CT window Hounsfield values [17]. To surpass this problem, we have defined a new P function:
72
J.S. Silva et al.
where I is the image, and (computed in the region to be segmented) are the mean intensity value and the dynamic range, respectively. For air bubble segmentation we used and The logarithmic variation used in equation (9) has advantages when compared to a linear variation. For low values, P is small and its derivative is high, meaning that the level-set is near the region to be segmented and must reduce its evolution speed (low P values) to be able to detect the region to be segmented. For high values, P is large having almost null derivative, meaning that the level-set is far away from the region to be segmented; moreover the evolution speed remains high and almost constant, continuing the search for regions to be segmented. Using this stopping function, the level-set will tend to adjust itself to the low intensity regions to be segmented, additional terms to impose convergence or fast evolution with constant direction (always increasing or decreasing values of no longer being needed. From (1) we obtain the following discrete evolution equation:
that proved to be robust and of fast convergence (using values up to 10 and even higher), as long as abnormal high values were clipped to reasonable values.
3.3 Post-processing Often, the level-set identifies not only the correct region, but also additional small regions, due to noise or irregular image texture. These unwanted small regions are discarded if their areas (or volumes, for 3D data) are lower than a threshold. By definition, bubbles have a diameter no less than 1cm; to segment them in HRCT thoracic data, we reject all regions with lower size.
4 Results The first experiments using our level set approach were performed on the fish cells images shown in figure la), using the stopping function of equation (9). We processed three different fish cells images. Convergence was achieved after 5 to 20 iterations, depending on the image under processing, with computation time of less than a minute per image. All processing was done on a Pentium 4 computer, 1.6GHz and 256Mb of RAM, using Matlab 6.5.
A Level-Set Based Volumetric CT Segmentation Technique
73
Fig. 1. Fish cells images: a) original image, b) and c) are the result after 9 and 12 iterations; d) final contours overlaid on the original image.
Once that our main goal is to segment pulmonary air bubbles from HRCT images, we processed one 2D image with the method described, and successfully identified several air bubbles in less than one minute, for an image of 512×512 pixels.
Fig. 2. Thoracic HRCT image with large air bubbles: a) original image, b) through d) are the result after 2, 3 and 9 iterations; e) final contours overlaid on the original image: large contours correspond to pulmonary air bubbles, small contours correspond to false candidates.
Applying the described method on a volume data set from a 3D HRCT exam, we are able to segment any number of pulmonary air bubbles placed anywhere inside the thoracic region. In figure 3, several air bubbles were segmented with only 5 iterations.
Fig. 3. Image from 3D thoracic exam: a) through c) result after 1, 2 and 4 iterations; d) final image after post-processing.
4.1 Evaluation To evaluate the performance of the proposed method, we used a real CT exam and inserted artificial air bubbles inside the lungs. Starting from one multi-slice CT acquisition, three exams were reconstructed from the same thoracic region, with different longitudinal resolutions (10mm, 5mm, 2.5mm) and different number of slices (12, 17 and 33 slices, respectively, each slice with 512×512 pixels). In each CT exam, several artificial air bubbles were inserted, with radius between 10mm and 30mm.
74
J.S. Silva et al.
The processing time is proportional to the number of slices and varied from 3 up to 12 minutes for exams with 10 and 33 slices, respectively.
Fig. 4. Relative volume variation, for three CT exams with air bubbles of several sizes.
From figure 4, we observe that the error is inversely proportional to the bubble volume, as expected, and to the number of slices: the larger the bubble, the higher the method accuracy, since small bubbles have a relative high longitudinal error, due to the large difference between axial (XX, YY) and longitudinal (ZZ) resolutions.
5 Conclusions We presented a new approach to segment pulmonary air bubbles. While maintaining the advantages of the traditional level-set method, such as the capability of topologic transformations and working with any number of dimensions, our level-set approach, which includes a new stopping function, allows a fast convergence even for large time steps. Motivated by the fact that HRCT thoracic images have significant noise, we developed a stopping function using a region seeking approach instead of the traditional edge seeking approach. To overcome manual initialization, we defined and implemented an automatic initialization procedure that surrounds the complete image (2D or 3D) and does not depend on the objects to be segmented. Experiments with different kinds of images were presented, which demonstrate the ability to detect several objects, as well as the power to simultaneously detect the interior and exterior of a region, as shown on cell boundaries. This approach was successfully applied in the 2D segmentation of cytoplasm fish cells and also in the segmentation of pulmonary air bubbles both in 2D and 3D HRCT images. Although this level-set approach was developed to identify air bubbles in the lungs, where the pulmonary tissues have non uniform textures due to aerial and blood trees, we believe that this segmentation method has potential applications in other medical image analysis domains, particularly in 3D. The proposed method is
A Level-Set Based Volumetric CT Segmentation Technique
75
valid for any number of dimensions: although it was implemented in 2D and 3D, it can be applied in n-D, even in non medical imaging situations. Future directions for this work include the quantitative evaluation of pulmonary air bubbles by expert radiologists, as well as air-way segmentation on 3D CT exams with very thin and adjacent slices.
References 1. Brink, J., et al.: Helical CT: Principles and Technical Considerations. RadioGraphics (1994) 14:887 - 893. 2. Wang, G., P. C. Cheng, M. W. Vannier: Spiral CT: Current Status and Future Directions. Proc. SPIE (1997) 3149:203-212. 3. Morgan, M. D. L., C. W. Edwards, J. Morris, H. R. Mattews: Origin and behaviour of emphysematous bullae. Thorax (1989) 44:533-538. 4. Reid, L.: The pathology of emphysema. Lloyd Luke, London (1967). 5. Silva, J. S., A. Silva, B. S. Santos, J. Madeira: Detection and 3D representation of pulmonary air bubbles in HRCT volumes. SPIE Medical Imaging 2003: Physiology and Function: Methods, Systems, and Applications (2003) 5031:430-439. 6. Kass, M., A. Witkin, D. Terzopoulos: Snakes: Active Contour Models. International Journal of Computer Vision (1988) 1:321-331. 7. Blake, A., M. Isard: Active Contours: Springer Verlag London Limited (1998). 8. Montagnat, J., H. Delingette, N. Ayache: A Review of Deformable Surfaces: Topology, Geometry and Deformation. Image and Vision Computing (2001) 19:1023-1040. 9. McInerney, T., D. Terzopoulos: Deformable Models in Medical Image Analysis: A Survey. Medical Image Analysis (1996) 1:91-108. 10. Osher, S., J. A. Sethian: Fronts Propagation with Curvature Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulations. Journal of Computational Physics (1988) 79:1249. 11. Malladi, R., J. A. Sethian, B. C. Vemuri: Shape Modeling with Front Propagation: A Level Set Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (1995) 17:158-175. 12. Caselles, V., R. Kimmel, G. Sapiro: Geodesic Active Contours. International Journal of Computer Vision (1997) 22:61-79. 13. Suri, J. S., K. Liu, S. Singh, S. N. Laxminarayan, X. Zeng, L. Reden: Shape Recovery Algorithms Using Level Sets in 2D/3D Medical Imagery: A State of the Art Review. IEEE Transactions on Information Technology in Biomedicine (2002) 6:8-28. 14. Kawata, Y., N. Niki, H. Ohmatsu, R. Kakinuma, K. Eguchi, R. Kaneko, N. Moriyama: Quantitative surface characterization of pulmonary nodules based on thin-section CT images. IEEE Transactions on Nuclear Science (1998) 45:1218-1222. 15. Kovacevic, D., S. Loncaric, E. Sorantin: Deformable Contour Based Method for Medical Image Segmentation. In: 21st International Conference on Information Technology Interfaces ITI’99 (1999). 16. Wang, H., B. Ghosh: Geometric Active Deformable Models in Shape Modeling. IEEE Transactions on Image Processing (2000) 9:302-308. 17. Hofer, M.: CT Teaching Manual. Georg Thieme Verlag, Stuttgart (Germany) (2000).
Robust Fitting of a Point Distribution Model of the Prostate Using Genetic Algorithms Fernando Arámbula Cosío CCADET, UNAM, Cd. Universitaria, A.P. 70-186 México, D.F., 04510.
[email protected]
Abstract. A Point Distribution Model (PDM) of the prostate has been constructed and used to automatically outline the contour of the gland in transurethral ultrasound images. We developed a new, two stage, method: first the PDM is fitted, using a multi-population genetic algorithm, to a binary image produced from Bayesian pixel classification. This contour is then used during the second stage to seed the initial population of a simple genetic algorithm, which adjusts the PDM to the prostate boundary on a grey level image. The method is able to find good approximations of the prostate boundary in a robust manner. The method and its results on 4 prostate images are reported.
1 Introduction Automatic segmentation of the boundary of an organ, in ultrasound images, constitutes a challenging problem of computer vision. This is mainly due to the low signal to noise ratio typical of ultrasound images, and to the variety of shapes that the same organ can present in different patients. Besides the theoretical importance of the problem, there are potential practical gains from automatic segmentation of ultrasound images, since ultrasound is a portable, low cost, real time imaging modality. It is particularly suitable for intraoperative image guidance of different surgery procedures. In this work is reported the automatic segmentation of the prostate boundary in transurethral ultrasound images. The final objective is to measure the prostate of a patient intraoperatively during a Transurethral Resection of the Prostate (TURP) for image guided surgery purposes. Transurethral images provide the same shape of the prostate during ultrasound scanning as well as during resection of the prostate, since the ultrasound probe is inserted through the same transurethral sheath of the resection instrument [1]. We could then reconstruct the 3D shape of the prostate accurately from a set of annotated transurethral images. Previous work on automatic segmentation of the boundary of the prostate in ultrasound images includes the following. Aarnik et al., [2] reported a scheme based on edge detection using second derivatives, and edge strength information obtained from the gradient at each edge location. Using the edge location and strength information, an edge intensity image is obtained. A complete boundary of the prostate in transrectal images is constructed A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 76–83, 2004. © Springer-Verlag Berlin Heidelberg 2004
Robust Fitting of a Point Distribution Model of the Prostate
77
from the edge intensity image using rules and a priory knowledge of the prostate shape. The boundary construction algorithm used was not reported. A segmentation scheme based on a variation of a photographic technique has been reported by Liu et al. [3], for prostate edge detection. The scheme does not produce a complete prostate boundary, it produces partial edge information from which the transrectal prostate boundary would need to be constructed. Dinggang et al. [4] report a statistical shape model for segmentation of the prostate boundary in transrectal ultrasound images. A Gabor filter bank is used to characterize the prostate boundaries in multiple scales and multiple orientations. Rotation invariant Gabor features are used as image attributes to guide the segmentation. An energy function with an external energy component made of the real and imaginary parts of the Gabor filtered images, and an internal energy component, based on attribute vectors to capture the geometry of the prostate shape was developed. The energy function is optimized using the greedy algorithm and a hierarchical multiresolution deformation strategy. Validation on 8 images is reported. A semi-automatic method is described by Pathak et al. [5]. Contrast enhancement and speckle reduction is performed using an edge sensitive algorithm called sticks. This is followed by anisotropic diffusion filtering and Canny edge detection. During image annotation a seed is placed inside the prostate by the user. False edges are discarded using rules, the remaining probable edges are overlaid on the original image and the user outlines the contour of the prostate by hand. Another semiautomatic method is reported in Gong et al. [6]. Superellipses are used to model the boundary of the prostate in transrectal images. Fitting is performed through the optimization of a probabilistic function based on Bayes theorem. The shape prior probability is modeled as a multivariate gaussian, and the pose prior as a uniform distribution. Edge strength is used as the likelihood. Manual initialization with more than two points on the prostate boundary, is required from the user. We have previously reported a simple global optimization approach for prostate segmentation on transurethral ultrasound images, based on a statistical shape model and a genetic algorithm, which optimizes a grey level energy function. The method was able to find accurate boundaries on some prostate images however the energy function used showed minimum values outside of the prostate boundary for other images [1]. In this paper we report a two stage method for global optimization of a statistical shape model of the prostate. During the first stage pixel classification is performed on the grey level image, using a Bayes classifier. A point distribution model [7] of the prostate is then fitted to the binary image, using a multipopulation genetic algorithm (MPGA), in this way a rough approximation of the prostate boundary is produced which takes into account the typical shape of the gland and the pixel distribution on the image. During the second stage of the process, the initial population of a simple genetic algorithm ( SGA) is seeded with the approximate boundary previously found. The SGA adjusts the PDM of the prostate to the gaussian filtered grey level image. In the following sections are described the method and its results.
78
F. Arámbula Cosío
2 Pixel Classification of Prostate Images Bayes discriminant functions [8] (eq.1) were used to classify prostate from background pixels. where: is the class conditional probability of class k, with k={prostate, background}; is the a priory probability of class k. Two mixture of gaussians models (MGM) of the class conditional probability distributions of the prostate and the background pixels were constructed, using the expectation maximization algorithm [8]. Each pixel sample (x) is a three-component vector (x, y, g) of the pixel coordinates (x, y) and its corresponding grey value (g). The training set consisted of: Np= 403010 prostate pixels; and Nb= 433717 background pixels. From the training sample proportions we can estimate the prior probability of class k as: and In figure 1 are shown two (non-training) prostate images and the corresponding pixel classification results, where a pixel is 255 if, for that pixel otherwise the pixel is zero.
Fig. 1. Results of pixel classification a) original images; b) Corresponding binary images
Robust Fitting of a Point Distribution Model of the Prostate
79
3 Prostate Model Optimization A Point Distribution Model (PDM) [7] of the shape of the prostate in transurethral images was constructed with a training set of 50 prostate shapes. The pose and shape of the model can be adjusted with 4 pose and 10 shape parameters [1]. The model is first adjusted to the binary image produced by pixel classification, using a multipopulation genetic algorithm (MPGA), with the following parameters: Probability of crossover (Pc = 0.6); Probability of mutation (Pm = 0.001); Number of subpopulations (Nsub =10); Number of individuals per subpopulation (Nind = 10); generation gap (GG = 0.9). The theory of genetic algorithms is presented in [9].
3.1 Model Fitting to the Binary Image An energy function for model fitting was constructed based on pixel profiles, 61 pixels long, perpendicular to the prostate model and located at regular intervals along the model, as shown in Fig. 2. The energy function (eq.2) is minimum for model instances continuously located around white regions and surrounded by the black background.
where:
n is the number of pixel profiles sampled; and
is the value (0 or 255) of pixel i.
An MPGA showed to be able to find the global minimum of the energy function in a consistent manner, while the single population genetic algorithm (SGA) is more sensitive to local minima. In figure 3 are shown the results of ten experiments using the MPGA and the SGA, to adjust the PDM of the prostate to a binary image produced by pixel classification.
80
F. Arámbula Cosío
Fig. 2. Pixel profile sampling during prostate model fitting
Fig. 3. Results of ten experiments of boundary fitting to a binary image, using: a) MPGA; b) SGA.
3.2 Model Fitting to the Grey Level Image The boundary obtained during model fitting to the binary image, is then used to seed the initial population of an SGA (Pc=0.6, Pm=0.001, N=50, GG=0.85) which is used to adjust the PDM to a gaussian filtered grey level image of the prostate. A grey level energy function was constructed (as shown in eq.3.) based on short (21 pixels long) grey level profiles sampled as shown in Fig. 2.
Robust Fitting of a Point Distribution Model of the Prostate
81
where:
n is the number of grey level pixel profiles sampled; and is the grey level value of pixel i. is designed produce minimum values when a boundary is placed around a dark (hypoechoic) region which is surrounded by a bright (hyperechoic) halo. In Fig. 1a can be observed that the prostate appears on ultrasound images as a dark region surrounded by a bright halo, however some prostates also show dark regions inside the gland (see Fig. 1a, bottom ), which could produce minimum values of in some cases. Pixel classification and boundary fitting to the binary image help to avoid dark regions inside the prostate as shown in the next section.
4 Results The method described was implemented using MATLAB ( Mathworks Inc.). In Fig. 4 are shown the results obtained, for 4 different ultrasound images, compared to the corresponding expert annotated images.
Fig. 4. Results of automatic boundary segmentation: a) expert annotated images; b) computer annotated images.
82
F. Arámbula Cosío
Fig. 4 (Cont.) . Results of automatic boundary segmentation: a) expert annotated images; b) computer annotated images.
In the images shown in Fig. 4a, the black circle in the middle corresponds to the position of the transurethral transducer. Around the transducer a dark (hypoechoic) region inside the prostate can be observed. These dark regions inside the prostate could produce minimum values of (eq. 3). However the rough approximation of the prostate contour produced by the MPGA on the binary image produced by pixel classification (section 3.1) helps to avoid these dark regions and helps the SGA to find the correct boundary in the grey level image, as shown in Fig. 4b.
5 Conclusions A new method for segmentation of the boundary of the prostate on transurethral ultrasound images is being developed. The method is based on a PDM of the prostate
Robust Fitting of a Point Distribution Model of the Prostate
83
boundary, which can only deform into shapes typical of the prostate, in this way reducing significantly the search space during model fitting. A rough approximation of the prostate shape and pose, on a digital image, is produced through pixel classification and model fitting to the resulting binary image. An MPGA showed to be robust during optimization of the binary energy function. During the second stage of the method, the PDM is adjusted using a SGA (which performs faster than the MPGA) on a gaussian filtered grey level image. The initial population of the SGA is seeded with the rough boundary previously obtained, this biases the search of the prostate boundary to the neighborhood of the initial estimate. This in turn, helps to avoid minimum values of the grey level energy function (eq. 3), that can produce gross errors in model fitting. Preliminary results showed that the method reported is able to find good approximations of the prostate boundary in different transurethral ultrasound images. It is a fully automatic scheme which does not require any user intervention. Our method constitutes a systematic approach to boundary segmentation in transurethral images, which show characteristic dark regions inside of the prostate that can produce boundary segmentation errors. These dark regions are not characteristic of transrectal prostate images, in which most of the boundary segmentation work has been performed. Further research will include an extensive evaluation of the robustness of our method with different image conditions, and the development of a final boundary refinement stage based on edge detection.
References 1. Arambula Cosio F. and Davies B.L.: Automated prostate recognition: A key process of clinically effective robotic prostatectomy. Med. Biol. Eng. Comput. 37 (1999) 236-243. 2. Aarnik R.G., Pathak S.D., de la Rosette J. J. M. C. H., Debruyne F. M. J., Kim Y., Wijkstra H.: Edge detection in prostatic ultrasound images using integrated edge maps. Ultrasonics 36 (1998) 635-642. 3. Liu Y.J., Ng W.S., Teo M.Y., Lim H.C.: Computerised prostate boundary estimation of ultrasound images using radial bas-relief method. Med. Biol. Eng. Comput. 35 (1997) 445454. 4. Dinggang S., Yiqiang Z., Christos D.: Segmentation of Prostate Boundaries From Ultrasound Images Using Statistical Shape Model. IEEE Trans. Med. Imag. 22 No.4 (2003) 539-551 5. Pathak S.D., Chalana V., Haynor D.R., Kim Y.: Edge-guided boundary delineation in prostate ultrasound images. IEEE Trans. Med. Ima. 19 No.12 (2000) 1211-1219. 6. Gong L., Pathak S.D., Haynor D.R., Cho P.S., Kim Y.: Parametric shape modelling using deformable superellipses for prostate segmentation. IEEE Trans. Med. Imag. 23 No. 3 (2004) 340-349. 7. Cootes T.F., Taylor C.J., Cooper D.H., Graham J.: Active shape models -Their training and application. Comput. Vision Image Understanding. 61 (1995) 38-59. 8. Bishop C.M.: Neural networks for pattern recognition. Oxford University Press (1995). 9. Golberg D. E.: Genetic algorithms in search optimization and machine learning. AddisonWesley (1989).
A Quantification Tool to Analyse Stained Cell Cultures E. Glory1,2,3, A. Faure1, V. Meas-Yedid2, F. Cloppet1, Ch. Pinset3, G. Stamon1, and J-Ch. Olivo-Marin2 2
1 Laboratoire SIP-CRIP5,Université Paris 5, 75006 Paris, France Laboratoire d’Analyse d’Images Quantitative, Institut Pasteur, 75015 Paris, France 3 Celogos SA, 75015 Paris, France
Abstract. In order to assess the efficiency of culture media to grow cells or the capacity of drugs to be toxic, we elaborated a method of cell quantification based on image processing. A validated approach makes segment on stained nuclei by thresholding the histogram of the best adapted color component. Next, we focus our attention on the classification methods able to distinguish isolated and aggregated nuclei because the aggregation of nuclei reveals a particuliar cell function. Two decision trees have been designed to consider the different shape features of two types of nuclei: coming a) from bone marrow and b) from immature muscular cell cultures. The most relevant characteristics are the concavity, the circularity and the area of binary objects.
1
Introduction
For growing cells in vitro, culture media have to contain components adapted to the nature of cells (muscular, nervous, epithelial cells...) and to the stage of their development. The establishment of correct conditions requires to fit the compositions with variable concentrations of components. A manual evaluation of the medium performance is tedious, time-consuming and yields subjective results. Then, this paper proposes a methodology to acquire and analyse culture images in order to quantify the number of cells and to characterize the morphology of their nuclei. This method is suitable to control the capacity of cells to grow normally and to measure the toxicity of drugs.The most common techniques biologists used by to count cells are the biochemical protocols wich quantify cell components, the Coulter ® counter or the cytometer. Adapted to cells in suspension, these techniques do not provide morphological information about adherent cells. But this kind of configuaration is important in our applications because a cluster of nuclei means a particular function of cells. Many papers have dealt with the segmentation of cells : techniques based on histogram, edge detection [3], multi-spectral [5], skeletonization [4], growing region [8], snakes [1][15] and level sets [14] has been developped. As cells and nuclei are often segmented with medical diagnosis goals, neural networks, trained with object subsets assigned by experts, performed a classification [8] . The first part of our project is not so much interested in the segmentation accuracy A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 84–91, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Quantification Tool to Analyse Stained Cell Cultures
85
than in the good quantification of nuclei. Although, it is clear that the counting performance depends on the quality of the segmentation. The paper is organized as follows : the section 2 presents the environmental condition of this study, the section 3 develops the algorithm used to segment nuclei and the last section deals with the methods tested to classify individual and aggregated objects.
2 2.1
Experimental Environment Biological Material and Protocol
Our biological experiments are principally achieved on bone marrow cells and muscular stem cells. As the majority of mammalian cells, each cell has one nucleus. However, a particular phenomenon occurs when muscular cells differentiate: mononucleate muscular cells fuse to form syncithia containing a large number of nuclei. To avoid practical constraints, cell cultures are fixed and stained with Giemsa, a stable histological dye, which specifically reveals nuclei. Thus, naturally translucent nuclei appear in magenta when observed in light microscopy. Cells grow in 12 or 96-multiwell plates and experiments are achieved in 3 to 8 wells to take into account the intrinsic variation of biological products (figure 1).
Fig. 1. Macroscopic view of 12 and 96 multiwell plates
2.2
Acquisition Step
The acquisition system is made of an inverted motorised microscope, a color CDD camera and a software which controls plate stage moving, autofocus, acquisition and image saving. The parameters of magnification, color balance, luminosity, time exposure, filters and calibration are identical for all experiments. The procedure of acquisition is fully automatic and independent of time because cultures are fixed. The size of image is
3
Image Analysis
Although there is a standard staining protocol, nuclei coloration depends on the nature of cells (origin, stage of cellular cycle, stage of differentiation) and on their
86
E. Glory et al.
environment (cellular concentration, culture medium...). We designed a fast and robust method to segment nuclei which is able to tackle on the variability of the coloration. Our algorithm is developed with ImageJ [7] and figure 2 gives a detailed outline of the different steps involved.
Fig. 2. Outline of the image processing algorithm
3.1
Segmentation Method
Among the most common color spaces, the normalised green component, equals to G/(R + G + B), is chosen to reduce the color information into grey level data. The details of this choice are presented in [9]. This component is robust to the variation of illumination which occasionally occurs during the acquisition. For reducing the computation time, the iterative algorithm proposed by Ridler in 1978 [11] is used to find the histogram threshold that separates cell nuclei from background:
is the threshold at the iteration and are respectively the grey level means of the background and the foreground classes delimited by Practically, iterations are stopped when the difference becomes small enough. Sometimes, images contain no nucleus because of a low concentration of cells or an uneven spatial distribution. The partitioning of such image is overcame by learning on a subset of random images. Next, this value is applied on all images of a given well.
A Quantification Tool to Analyse Stained Cell Cultures
3.2
87
Classification and Segmentation of Nuclei Aggregates
The above procedure usually results in binary images that contain noisy objects and artefacts. Then, a procedure of mathematical morphology opening is applied to eliminate isolated pixels and to smooth edges. The obtained connected components are either artefacts or nuclei or aggregates. These three classes are defined by using the histogram of the object areas. The major mode corresponds to the surface of isolated nuclei considered as the area of reference, because the most numerous objects are individual nuclei. Nevertheless, to avoid that the major mode of histogram selects the numerous noisy objects, a fixed value depending on the magnification is established in order to eliminate too small areas which have no biological interpretation. Practically, the histogram is built with bins of 20 pixel length in order to conciliate the result robustness, the accuracy and the mode research speed. The three classes are determined as follows:
Small objects, with an area smaller than half of the reference area of individual nuclei are considered as artefacts and are eliminated. Large objects with an area larger than twice the reference area are considered as aggregated and are processed separately. One can notice that isolated nuclei may have an area ratio varying from 1 to 8 in one image, thus large objects may also be big nuclei. Finally, the remaining objects are individual nuclei. Then, the watershed algorithm [13] is applied on the large objects category to split the aggregates. The use of the splitting method on the only interested objetcs speeds up the processes and prevents individual nuclei from being over-segmented. The results of watershed are very satisfactory when the shape of the aggregate can be describes as ellipses joined side by side. Otherwise, aggregates are sub-segmented, over-segmented or wrongly segmented. Interesting objects are finally obtained by merging the results of watershed and the isolated nuclei. Several features are computed to characterize them: area, perimeter, XY centroid coordinates, width and height of their rectangular bounding box, orientation of their principal inertia axis, length of small and long axes of the best fitted ellipse. Data analysis computes information concerning conditions -relative to similar wells- from image extraction, by calculating means and standard deviations of objects, according to their embedded belonging (image, well, condition, experiment). We can decide to put some images or a complete well aside if they are slurred (autofocus default), not biologically representative (experimental default) or if the segmentation is too wrong (image processing default).
3.3
Results
The difference between the number of cells put at the beginning of the experiment and the counting of nuclei when cultures are fixed provides the doubling time
88
E. Glory et al.
which measures the growth capacity of the medium. The Student’s t-test is used to compare the relevance of each medium to increase the growing property. This algorithm has already been used to study more than 12000 photos, representing about 300 culture conditions. To evaluate the counting performance of the proposed method, two experiments of 301 images are compared with the results of a manual operator. Details are shown in the table of figure 3.
Fig. 3. Comparison of the automatic counting with the manual counting. Subsegmentation and over-segmentation are mentioned to understand the differences.
Sub-segmentations are observed in three cases : i) nuclei are missed, due to a dark coloration closer from black than magenta, ii) object surfaces are smaller than the artefact adaptive or iii) aggregated nuclei are wrongly split (wrongly classify as individual nuclei or not divided by watershed). Oversegmentation occurs when watershed splits an isolated nucleus into several parts. These two phenomena cancel each other out, therefore the difference between human and computer counting is about 1 to 5 %. This evaluation is done on two experiments of non-confluent cultures (figure 4).
Fig. 4. Image of muscular cell nuclei stained with Giemsa and its segmentation. The algorithm finds all the nuclei (except those which are too small) without detecting dark artefacts. Drawbacks are cytoplasmic areas considered as nuclei and wrongly split aggregates
The variability of individual nuclear size shows the limits in choosing area as feature to find aggregates. The next section deals with the classification methods we developed to improve this step.
A Quantification Tool to Analyse Stained Cell Cultures
4 4.1
89
Binary Object Classification Data and Hypothesis
Aggregates appear when the stage of differentiation or the cellular concentration makes conditions to cluster nuclei. Our goal is to find a way which provides a subset of objects containing a majority of aggregated nuclei with as few as possible individual nuclei. To measure the quality of the classification, a manual partitioning has been made by an operator, considering human knowledge, color information and shape perception. The shape features used as input are those reported in section 3.2. A combination of them allows computing the compactness, the ratio of object area on bounding box area, the ratio of their perimeters and the ratio of axes lengths. As we notice that human interpretation uses information of concavity, we have defined some patterns of Freeman’s code to localise concave shapes from edges [6]. As objects are small (about 70 pixel area for individal nuclei), concavity patterns are found for invaginations between two nuclei but also for noisy edge pixels. Even if the concavity is a noisy sensitive criteria, the information of “noconcavity” correctly dicriminates individual nuclei in bone marrow cell culture.
4.2
Method
The experiment is achieved on two types of cells: bone marrow cells - characterised by rather round nuclei - and non-differentiated muscular cells - more ellipsoidal nuclei-. The proportion of aggregates are respectively 7.6% and 21.7% of total objects, but this information cannot be used in a Bayesian approach [12] because these ratio change from an image to another, and from an experiment to another. After testing the principal component analysis method [6] and a basic neural network, which did not give satisfactory enough results, we have used the decision tree approach. No feature is able to clearly distinguish the isolated and the aggregated objects. Nevertheless, the threshold values of feature distributions are determined to discriminate, at best, two kinds of objects.
4.3
Results
We have found that concavity and compactness are the most discriminant characteristics for the bone marrow aggregates. For the muscular aggregates, compactness and area are selected. Thanks to decision trees, the class of interest recovers more than 95 % of the aggregates contained in the initial input (figure 6). Compared to the method used in section 3.2 only based on area feature, this approach gives better results (figure 6). The percentage of classification seems better for muscle nuclei but the number of aggregates found is greater (recall in fig. 6). This method has to be validated on other experiments before introduce it in the general algorithm at the classification step. Future studies will improve the splitting process of
90
E. Glory et al.
Fig. 5. Decision trees of bone marrow and muscular objects. The root of the tree represents the number of objects to classify (633 respectively 842). Actually, it contains 48 (respectively 183) true aggregates determined by an expert. The surrounded leaves are the result of the classification. 55 (207) objects are counted for 46 (178) true aggregates.
the aggregates in order to accurately localize nuclei and describe their relative position. Indeed, the aggregation of nuclei is related to the function of cells and it would be interesting to quantify this phenomenon.
Fig. 6. Percentage of good processus of classification could miss aggregates, the percentage of is computed.
5
As the
Conclusion
We have described a computerised vision system developed to quantify stained cell cultures. To assess the growth capacity of a given medium, we measure the increase in the number of nuclei after a given time of culture. The segmentation of nuclei is achieved with the threshold of the normalised green component histogram. The segmentation is refined by applying the watershed technique on a fraction of objects we consider as nuclei aggregates. The number of nuclei counted with this algorithm is similar to the human counting on non-confluent cultures. Moreover, we have designed a way to better distinguish the aggregated nuclei from the isolated ones by using decision trees depending on the nature of cells. The most discriminant features are the concavity, the area and the circularity. Final subsets recover more than 95 % of the actual aggregates and retains less than 15 % of isolated nuclei. Future work will be devoted to spliting more precisly the aggregates of nuclei.
A Quantification Tool to Analyse Stained Cell Cultures
91
References l. Bamford, P.: Segmentation of cell images with an application to cervical cancer screening. PhD thesis - University of Queenland (1999) 2. Bitter, E., Benassarou, A., Lucas, L., Elias, L., Tchelidze, P., Ploton, D., O’Donohue, M.-F.: Analyse et Modélisation à l’Intérieur de Cellules Vivantes dans des Images 4D. R.F.I.A. 14ème Congrès Francophone AFRIF-AFIA, Toulouse, France (2004) 845–854 3. Brenner, J.J., Necheles, T.F., Bonacossa, I.A., Necheles, T.F., Fristensky, R., Weintraub, B.A., Neurath, P.W.: Scene segmentation techniques for the analysis of routine bone marrow smears from acute lymphoblastic leukaemia patients. Journal of Histochemistry and Cytochemistry 25-7 (1977) 601–613 4. Cloppet, F., Oliva, J.M., Stamon, G.: Angular Bisector Network, a Simplified Generalized Voronoï Diagram: Application to Processing Complex Intersections in Biomedical Images. IEEE Trans, on Pattern Analysis and Machine Intelligence 22-1 (2000) 120–128 5. Fernandez, G., Kunt, M., Zryd, J.-P.: Multi-spectral based cell segmentation and analysis. Proc. of the Workshop on Physics-based Modeling in Computer Vision, Cambridge, USA (1995) 166–172 6. Gonzalez, R.C., Wintz, P.: Digital Image Processing. Addison Wesley (1987) 122– 125, 392–394 7. Rasband, W.: ImageJ, Version 1.31s. National Institues of Health, USA (2004) http://rsb.info.nih.gov/ij/ 8. Lezoray, O., Elmoataz, A., Cardot, H., Revenu, M. : A.R.T.I.C : Un systéme automatique de tri cellulaire par Analyse d’images. Vision Interface (1999) 312–319 9. Meas-Yedid, V., Glory, E., Pinset, Ch., Stamon, G., Olivo-Marin, J-Ch.: Automatic color space selection for biological image segmentation. International Conference of Pattern Recognition (2004) to appear 10. Pal, N. R., Pal, S. K.: A review on image segmentation techniques. Pattern Recognition 26-9 (1993) 1277–1294 11. Ridler, T.W., CalvardClarke, S. : Picture thresholding using an iterative selection method. IEEE Trans. System, Man and Cybernetics 8 (1978) 630–632 12. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Accademic Press (1999) 13–20 13. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. on Pattern Analysis and Machine Intelligence 13-6 (1991) 583–598 14. Xie, X., Mirmehdi, M.: Level-set based geometric color snakes with region support. Proc. of International Conference on Image Processing 2 (2003) 153–156 15. Zimmer, C., Labruyère, E., Meas-Yedid, V., Guillèn, N., Olivo-Marin, J.-Ch.: Improving active contours for segmentation and tracking of motile cells in videomicroscopy. 16th International Conference on Pattern Recognition, 2 (2002) 286–289
Dynamic Pedobarography Transitional Objects by Lagrange’s Equation with FEM, Modal Matching, and Optimization Techniques Raquel Ramos Pinho1 and João Manuel R.S. Tavares1,2 1
FEUP – Faculdade de Engenharia da Universidade do Porto LOME – Laboratório de Óptica e Mecânica Experimental 2 DEMEGI – Departamento de Engenharia Mecânica e Gestão Industrial Rua Dr. Roberto Frias, s/n, 4200-465 PORTO, PORTUGAL {rpinho, tavares}@fe.up.pt
Abstract. This paper presents a physics-based approach to obtain 2D or 3D dynamic pedobarography transitional objects from two given images (2D or 3D). With the used methodology, we match nodes of the input objects by using modal matching, improved with optimization techniques, and solve the Lagrangian dynamic equilibrium equation to obtain the intermediate shapes. The strain energy involved can also be analysed and used to quantify local or global deformations.
1
Introduction
Pedobarography is the measurement of dynamic variations in downward pressure by different areas of the foot sole, using a pedobarograph (apparatus for recording dynamic variations as a person stands upright or walks). The recording of pedobarographic data during a normal walking step allows the dynamic analysis of the feet’s behaviour [1], [2]. For example diabetic patients suffer from irrigation problems which may cause ulcerations [2]. So, it is of interest to determine the conditions that can increase these occurrences, through the analyses of the temporal evolution of the support surfaces, the detection of the plantar hiperpressure zones, and the analysis of the spatial and temporal gradients of zones with higher pressure. The technical solutions used nowadays to analyse the sequences of pedobarographic images have some deficiencies and are almost subjective [2]. On the other hand, it might be necessary to determine some intermediate pedobarographic images of a given sequence. So to estimate the transitional images we use a physicsbased approach: we solve the Lagrange’s equation (LE) between the models built by the Finite Element Method (FEM) [3], [4]. With this solution, as more nodes are matched, which can be improved by using modal matching with optimization tech-
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 92–99, 2004. © Springer-Verlag Berlin Heidelberg 2004
Dynamic Pedobarography Transitional Objects by Lagrange’s Equation
93
niques [5], the obtained intermediate images are coherent with the reality, as can be verified with the experimental results to be presented. The idea of considering physical restraints, in objects’ modelling, has been suggested and used in computational vision by several authors, for example Terzopoulos, to simulate realistic deformation with an elastic model based on the Lagrange equation [6], [7]. It has been seen that when objects are represented according to physical principals, the non-rigid movement can be adequately modelled. In this work, we have attended to what other authors have done, namely: the restrictions that prevent inadequate matches according to the considered criteria [8], the modal matching process proposed by Shapiro [9], and the development of an isoparametric finite element that can be used to adequately model image represented objects [2], [10], [11]. In our previous work we have used the Lagrange’s equation to simulate the deformation between 2D represented objects’ outlines [4] (some obtained from medical images [3]), which was extended in this paper to the application domain of pedobarography, namely by considering 3D pedobarographic objects and 2D isopressure contours. The criteria that we used to estimate the applied charges also had to be reformulated, by adapting it to these pedobarography objects. In order to obtain more realistic transitional shapes, we applied a previously proposed approach to match the objects nodes based on optimization techniques [5] and then solve the Lagrange’s equation with a much higher number of successfully matched nodes. In previous works (see for example [4]) we had analysed the evolution of the global and local strain energy values as the deformation takes place, but in this paper we study the influence of each mode to that evolution.
2
Dynamic Pedobarography
To measure and visualise the distribution of pressure under the foot sole during a step, a system can be used with a glass or acrylic plate trans-illuminated through its polished borders in such a way that the light is internally reflected. This plate is covered on its top by a single or dual thin layer of soft porous plastic material where the pressure is applied (see fig. 1). In fig. 2 is represented a sample sequence image, and as can be seen the information obtained is very dense and rich [1], [2].
3
From Images to Objects’ Models
To obtain 2D/3D objects from the given images (like fig. 2) standard image processing and analysis techniques are used [2]. The 2D objects we considered are the contours obtained by the objects’ outlines or else isocontours whose pressure values are fixed. On the other hand, the 3D objects we considered are built by modelling each input image as a pressure surface with a membrane that is plane in the zones with no pressure and that deforms itself proportionally to the submitted pressure. So, the virtual 2D contours and 3D surfaces obtained can be analysed as if they where real physical objects [2].
94
R.R. Pinho and J.M.R.S. Tavares
Fig. 1. Basic pedobarography principle [1], [2]
Fig. 2. Image of a pedobarography sample sequence [1], [2]
To physically model each of the given objects we employed the Finite Element Method (FEM), namely by using the Sclaroff’s isoparametric element [11]. Using this finite element, the models built for 2D contours delimit a virtual object with elastic properties, and the obtained model is like an elastic membrane. When a 3D surface object is modelled by the same element, it is as if each feature point is covered by a blob of rubbery material [2], [11]. To build the Sclaroff’s isoparametric finite element model we used as nodes the objects’ data points and assemble the interpolation matrix H (which relates the distances between objects’ nodes) using Gaussian functions:
where is the function’s n -dimensional center, and The interpolation functions, are given by:
controls data interaction.
where are the interpolation coefficients, with value 1 (one) at node i, and 0 (zero) at all other nodes, and m is the number of nodes. The interpolation coefficients can be determined by inverting matrix G defined as:
This way, the interpolation matrix of Sclaroff’s isoparametric element, for a 2D shape will be:
Dynamic Pedobarography Transitional Objects by Lagrange’s Equation
95
The mass and stiffness matrices, M and K respectively, are then built as usually [2], [10], [11], and the used damping matrix, C, was considered as a linear combination of those two matrices. The process of physically modelling 3D objects is entirely analogous.
4
Matching Objects
To match the initial and target 2D models’ nodes, each generalized eigenvalue/vector problem is solved:
where is the matrix of the shape vectors (which describes the displacement (u,v) of each node due to vibration mode i) and is the diagonal matrix whose entries are the squared eigenvalues and with the frequency of vibrations’ squares increasingly ordered. After building each modal matrix, by comparing the displacement of each node in the modal eigenspace, some nodes can be matched and so the affinity matrix, Z , is built:
In this matrix, the affinity between nodes i and j will be 0 (zero) if the match is perfect, and will increase as the match worsens. In this work, to find the best match two search methods are considered: (1) a local method or (2) a global one. The local method was proposed in [9], [10], [11], and previously used with pedobarography data in [1], [2], [12], and basically it consists in searching each row of the affinity matrix for its lowest value, adding the associated correspondence to the matching solution if that value is also the lowest of the associated column. This method has the principle disadvantage of disregarding the object structure as it searches for the best match for each node. On the other hand, the global method consists in describing the matching problem as an assignment problem, and solving this problem using an appropriate optimization algorithm [5]. With the global search matching approach, cases in which the number of objects’ points to match is different can also be considered: initially the global search algorithm add fictitious points to the model with fewer elements, then the points that are matched with the fictitious elements are adequately matched by using a neighbourhood and an affinity criterion. This way matches of type “one to many” or vice versa are allowed for the model’s excess points [5]. Once again the process of matching 3D objects is entirely analogous [2], [5], [11].
96
5
R.R. Pinho and J.M.R.S. Tavares
Resolution of Lagrange’s Equation
In this work, to obtain the transitional objects’ shapes attending to physical properties we solve, using the Mode Superposition Method, [4], [13], [14], the Lagrange’s Equilibrium Equation:
where U are the nodal displacements and R are the implicit applied charges. The Mode Superposition Method requires the initial displacement and velocity vectors to solve the Lagrange’s Equation. The solution found to estimate the first one is to consider it as a part of the expected modal displacement. On the other hand, the initial modal velocity was estimated as a part of the initial modal displacement. For the implicit applied charges on each node, we considered them as proportional to the expected displacement.
6
Experimental Results
The used approach was included in a software platform previously built, to develop and test image processing and computer graphics algorithms (for a detailed presentation see [15]). In this section, we will present some of the obtained results when the approach previously described is applied on real dynamic pedobarography images. For the first example, consider contours 1 and 2 (represented in fig. 3) obtained from two distinct images [2]. If the modal matching is done on a local search basis, we obtained 35 matched nodes (fig. 4), while when optimization techniques are used, all of the 75 nodes of the initial contour are successfully matched (fig. 5). In fig. 6 are represented the seven intermediate contours obtained with the presented approach to estimate the intermediates shapes. For the second example, consider surfaces 1 and 2 in fig. 7, obtained from two pedobarography images [2], with 103 and 117 nodes respectively. When modal matching is done without optimization techniques, only 43 nodes are matched (fig. 7), however if these techniques are considered then all nodes are successfully matched (fig. 8). In fig. 9 are represented the fifth and the last (eighth) intermediate shapes obtained by our physically-based approach. When the strain energy involved is analysed [2], we notice that its values are proportional to the displacement in each step, that is as the simulation starts to evolve, the displacements tend to increase but as the target shape is approached, the displacements/strain energy decrease [4]. Moreover, as can be seen in fig. 10, where the strain energy involved in the all (8) estimated steps of the deformation between surfaces 1 and 2 is distributed by twelve groups of modes (The number of groups of modes considered is just for representation and analysis purposes.). We noticed that: when the deformation between objects is almost global, then the strain energy is more concentrated in the first groups of modes (which happens in this example); if rather local deformation is significant, then higher modes are more influential in the strain energy values [2].
Dynamic Pedobarography Transitional Objects by Lagrange’s Equation
97
Fig. 3. Contours 1 and Fig. 4. Matching be- Fig. 5. Matching be- Fig. 6. Estimated 2 tween contours 1 and tween contours 1 and intermediate con2 with global search 2 with local search tours
Fig. 7. Matching obtained be- Fig. 8. Matching obtained be- Fig. 9. and intermeditween surfaces 1 and 2 with lo- tween surfaces 1 and 2 with ate shapes obtained from cal search surfaces 1 and 2 global search
For the third and last example, consider two isocontours extracted from surfaces 1 and 2 of the previous example (fig. 11). If modal matching is used with the usually local search, only 25 of the 68 nodes are matched, but if optimization techniques are considered, then 54 nodes can be matched (fig. 12). In fig. 13 are represented two intermediate shapes of the thirty that can estimate the evolution of the foot’s pressure surface at a determined value.
7
Conclusions
In this paper, we have presented an approach that can be used to estimate transitional objects of a given sequence of pedobarographic images: We started by extracting 2D/3D objects from the input images with standard image processing techniques. Then, the obtained objects were physically modelled by using the Sclaroff’s isoparametric finite element. After that, correspondences between the models’ nodes
98
R.R. Pinho and J.M.R.S. Tavares
were established by modal analysis improved with optimization techniques. Finally, we solve the Lagrange’s equation to obtain the intermediate shapes according to physical principles.
Fig. 10. Strain Energy involved in the deformation between surfaces 1 and 2 during the estimated steps distributed by 12 groups of modes ordered increasingly (I-XII)
Fig. 11. Isocontours obtained Fig. 12. Matching, between Fig. 13. and intermefrom surface 1 and 2, respec- isocontours of Fig. 11, with diate shapes obtained from the tively local and global search, re- isocontours of Fig. 11 spectively
The experimental results obtained, some presented in this paper, confirm that satisfactory matching results can be obtained for dynamic pedobarographic image data. It is also verified that the used global search matching strategies improves considerably these results. Another advantage of the improved search strategy based on optimization techniques we used, is that the global matching methodology becomes easier to use and more adaptable to experimental applications [5]. This improved algorithm also matches satisfactorily all excess models’ nodes, which can prevent the loss of information along the image sequence. During experimentations, we have also noticed that: as the correspondence between nodes gets better, the estimated intermediate shapes are more trustworthy and
Dynamic Pedobarography Transitional Objects by Lagrange’s Equation
99
realistic and so this physics–based approach estimates the objects behaviour as it would be physically expected. When analysing the strain energy values along the estimated deformation, we are able to confirm that global deformation’s stain energy is concentrated in low order modes, which are related to rigid deformations, and that local deformation’ strain energy is due to the higher modes’ influence, responsible for local deformations (such as noise). Acknowledgments. The first author would like to thank the support of the PhD grant SFRH / BD / 12834 / 2003 of the FCT - Fundação de Ciência e Tecnologia in Portugal.
References [1] [2] [3] [4] [5] [6] [7]
[8] [9]
[10] [11] [12] [13] [14] [15]
J. Tavares, J. Barbosa, A. J. Padilha: Matching Image Objects in Dynamic Pedobarography, 11th Portuguese Conference on Pattern Recognition, Porto, Portugal (2000) J. Tavares: Análise de Movimento de Corpos Deformáveis usando Visão Computacional, Tese de Doutoramento, FEUP (2000) R. Pinho, J. Tavares: Resolution of the Dynamic Equilibrium Equation to Simulate Objects’ Movement/Deformation, 7th Portuguese Conference on Biomedical Engineering, Lisboa, Portugal (2003) R. Pinho, J. Tavares: Morphing of Image Represented Objects by a Physical Methodology, 19th ACM Symposium on Applied Computing, Nicosia, Cyprus (2004) L. Bastos, J. Tavares: Matching of Objects Nodal Points Improvement using Optimization, Inverse Problems, Design and Optimization Symposium, Rio de Janeiro, Brasil (2004) D. Terzopoulos, A. Witkin, M. Kass: Constraints on Deformable Models: Recovering 3D Shape and Nonrigid Motion, Artificial Intelligence Journal, vol. 36, pp. 91-123 (1988) D. Terzopoulos, D. Metaxas: Dynamic 3D Models with Local and Global Deformations: Deformable Superquadrics, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 7, pp. 703-714 (1991) J. L. Maciel, J. P. Costeira: A Global Solution to Sparse Correspondence Problems, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 187-199 (2003) L. S. Shapiro, J. M. Brady: Feature-Based Correspondence: An Eigenvector Approach, Image and Vision Computing, vol. 10, pp. 283-288 (1992) S. Sclaroff, A. Pentland: Modal Matching for Correspondence and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 6, pp. 545-561 (1995) S. Sclaroff: Modal Matching: A Method for Describing, Comparing, and Manipulating Digital Signals, PhD Thesis, MIT (1995) J. Tavares, J. Barbosa, A. Padilha: Determinação da Correspondência entre Objectos utilizando Modelação Física, 9° Encontro Português de Computação Gráfica, Marinha Grande, Portugal (2000) R. Cook, D. Malkus, M. Plesha: Concepts and Applications of Finite Element Analysis, John Wiley & Sons Inc (1989) K. Bathe: Finite Element Procedures, Prentice Hall (1996) J. Tavares, J. Barbosa, A. Padilha: Apresentação De Um Banco De Desenvolvimento E Ensaio Para Objectos Deformáveis, RESI - Revista Electrónica de Sistemas de Informação, vol. 1, no. 1 (2002)
3D Meshes Registration: Application to Statistical Skull Model M. Berar 1, M. Desvignes1, G. Bailly2, and Y. Payan3 1
Laboratoire des Images et des Signaux (LIS), 961 rue de la Houille Blanche, BP 46, 38402 St. Martin d’Hères cedex, France {Berar, Desvignes}@lis.inpg.fr 2
Institut de la Communication Parlée (ICP), UMR CNRS 5009, INPG/U3, 46,av. Félix Viallet, 38031 Grenoble, France
[email protected]
Techniques de l’Imagerie, de la Modélisation et de la Cognition (TIMC), Faculté de Médecine, 38706 La Tronche, France
[email protected]
Abstract. In the context of computer assist surgical techniques, a new elastic registration method of 3D meshes is presented. In our applications, one mesh is a high density mesh (30000 vertexes), the second is a low density one (1000 vertexes). Registration is based upon the minimisation of a symmetric distance between both meshes, defined on the vertexes, in a multi resolution approach. Results on synthetic images are first presented. Then, thanks to this registration method, a statistical model of the skull is build from Computer Tomography exams collected for twelve patients.
1 Introduction Medical Imaging and computer assisted surgical techniques may improve current maxillo-facial surgical protocol as an aid in diagnostic, planning and surgical procedure [1]. The steps of a complete assisted protocol may be summarized as : (1) Morphological data acquisition, including 3D imaging computed from Computer Tomography (CT) scanner, (2) Data integration which requires a 3D cephalometry analysis, (3) Surgical planning , (4) Surgical simulation for bone osteotomy and prediction of facial soft tissue deformation, (5) Per operative assistance for respecting surgical planning. Three-dimensional cephalometric analysis, being essential for clinical use of computer aided techniques in maxillofacial, are currently in development [2,3,4].In most methods, the main drawback is the manual location of the points used to build the maxillofacial framework. The relationship between the cephalometry and the whole scans data is flawed by the amount of data and the variability of the exams. A common hypothesis is a virtual link between a low dimension model of the skull and these points. We choose to first construct a statistical model of the skull, which will be link to a cephalometrics points model. This paper first presents data acquisition. In a second part, registration is described. Then, results on synthetic images are discussed and the construction of a statistical skull model is presented.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 100–107, 2004. © Springer-Verlag Berlin Heidelberg 2004
3D Meshes Registration: Application to Statistical Skull Model
101
2 Method The literature treating registration methods is very extensive (e.g., [5] for a survey). On one side are geometry based registration, which used a few selected points or features, where Iterative Closest Point and Active Shape Model are two classical approaches [6]. The main drawback of most of these methods is the need for the manual location of the landmarks used to drive the correspondence between objects in advance. On the other side are intensity-based algorithms, which use most of the intensity information in both data set [7].
2.1 Data Acquisition and 3D Reconstruction of the Patient’s Skull Coronal CT slices were collected for the partial skulls of 12 patients (helical scan with a 1-mm pitch and slices reconstructed every 0.31 mm or 0.48 mm). The Marching Cubes algorithm has been implemented to reconstruct the skull from CT slices on isosurfaces. The mandible and the skull are separated before the beginning of the matching process, our patients having different mandible relative position. (Figure 1, left panel). In order to construct the statistical skull model, we need to register all the high density / low density meshes in a patient-shared reference system [8]. In this system, the triangles for a region of the skull are the same for all patients, the variability of the position of the vertexes will figurate the specificity of each density mesh in a patient. The vertex of these shared mesh can be considered as semilandmarks, i.e. as points that do not have names but that correspond across all the cases of a data set under a reasonable model of deformation from their common mean [9,10]. This shared mesh was not obtained with a decimation algorithm. Because our goal is to predict anatomical landmarks (some of cephalometric points) from the statistical skull model, we choose not to use a landmark based deformation [as in 11] but a method that does not require specification of corresponding features. The low definition model (Figure 1, right panel) was therefore taken from the Visible Woman Project.
Fig. 1. High definition mesh (left), low definition mesh(right).
2.2 Shaping a Generic Model to Patient-Specific Data: 3D Meshes Registration The deformation of a high definition 3D surface towards a low definition 3D surface is obtained by an original 3D-to-3D matching algorithm.
102
M. Berar et al.
Fig. 2. Applying a trilinear transformation to a cube
2.2.1 3D to 3D Matching The basic principle of the 3D-to-3D matching procedure developed by Lavallée and colleagues [12] consists basically in deforming the initial 3D space by a series of trilinear transformations applied to elementary cubes (see also figure 2 ) :
The elementary cubes are determined by iteratively subdividing the input space in a multi resolution scheme (see figure 3) in order to minimize the distance between the 3D surfaces:
where S is the surface to be adjusted to the set of points q, p the parameters of the transformation T (initial rototranslation of the reference coordinates system and further a set of trilinear transformations). P(p) is a regularization function that guaranties the continuity of the transformations at the limits of each subdivision of the 3D space and that authorizes larger deformations for smaller subdivisions. The minimization is performed using the Levenberg-Marquardt algorithm [13].
Fig. 3. Subdivision of n elementary volume of the original space and new transformations vectors (2D simplification) (left). Subdividing the space and applying the transformation (right).
3D Meshes Registration: Application to Statistical Skull Model
103
Fig. 4. Matching a cone (source) toward a sphere (target) (left). Mismatched cone using the single distance method (centre); matched cone using the symmetric distance method (right).
2.2.2 Symmetric Distances In some cases, the transformed surface is well-matched to the closest surface but the correspondence between the two surfaces is false [see figure 4]. This mismatching can be explained by the two distances between each surfaces, which are not equivalent due to the difference of density between the two meshes. In this case, the distance from the source to the target (expressed in the minimization function) is very low whereas the distance from the target to the source is important (see Table 1). We therefore included the two distances in the minimization function as in [14]:
To compute the distance between the target and the source, the closest points of the low density vertexes towards the high density (points in equation 2) are stored. is the barycentre of this set of points in the distance between the high density mesh (target) and the low density mesh (source).
3 Results 3.1 Synthetic Images We first try these two methods on a set of four forms obtained with the same procedure. Each form is generated with two levels of density (low and high) before or after decimation. The following table show the benefits of the “symmetric distance” method for these 8 objects, compared to the “single distance” method.
104
M. Berar et al.
Table 2 summarises results : The method is well suited for shapes of same topology. But different topologies are not registered: a sphere deformed to the open ring shape will not capture the aperture of the ring, and a cone will “flat” himself in the centre of the ring.
3.2 Real Data: Mandible Meshes The low density mandible meshes are generated using the “symmetric distance” method. The single distance approach leads to many mismatches in the condyle and goniac angle regions (figure 5). The maximal distances are located on the teeth (which will not be included in the model, but are used for correspondences during the registration) and in the coronoid regions. The mean distances can be considered as the registration noise, due to the difference of density (see Table 3).
3.3 Application: Skull Statistical Model 12 CT patient’s scans with different pathologies are used. Half of them suffer from sinus pathologies, while the other half suffer from pathology of the orbits. The CT scans are centred around the pathology and do not include (except for one patient) the
3D Meshes Registration: Application to Statistical Skull Model
105
skull vault. The patients have different mandible positions, so the skull and the mandible were registered separately.
Fig. 5. Mismatched parts of mandible using the single distance method (left: condyle, center : goniac angle) and matched low density mesh to high density mesh using symmetric distance method.
After jointing these two parts of our model, they are aligned using Procrustes registration on the mean individual, as the statistical shape model must be independent from the rigid transformations (translation, rotation). Gravity centres are first aligned. Then the optimal rotation that minimizes the distance between the two set of points is obtained. The statistical model can only have 12 degrees-of-freedom (DOF), for a set of 3938 points (potentially 11814 geometrical DOF), as the number of DOF is limited by the number of patients. Using a simple statistical analysis, we show that 95% of the variance of the data can be explained with only 5 parameters (see Table 4). These “shape” parameters are linear and additive :
where M is the mean shape, A the “shape” vector, and
the shape coefficients.
Figure 6 shows the effects of the two first parameters. The first parameter is linked to a global size factor, whereas the second influences the shapes of the forehead and of the cranial vault.
4 Conclusion In this paper, a new registration approach for 3D meshes has been presented. In our application, one mesh is a high density mesh, the second a low density one. To enhance the registration, a symmetric distance has been proposed in a multi resolution
106
M. Berar et al.
approach. Results on synthetic and real images exhibit good qualitative performances. This method is then used to elaborate a statistical skull model.
Fig. 6. Effects of the first (left) and second (right) parameters for 3 times the standard deviations.
References 1. Chabanas M., Marecaux Ch., Payan Y. and Boutault F.. Models for Planning and Simulation in Computer Assisted Orthognatic Surgery, 5th Int. Conf. MICCAI’02, Springer, LNCS vol. 2489, (2002) 315-322. 2. Marécaux C., Sidjilani B-M., Chabanas M., Chouly F., Payan Y. & Boutault F. A new 3D cephalometric analysis for planning in computer aided orthognatic surgery. First International Symposium on Computer Aided Surgery around the Head, CAS-H, Interlaken, (2003),. 61. [Abstract to appear in Journal of Computer Aided Surgery.]. 3. Olszewski R, Nicolas V, Macq B, Reychler H. ACRO 4D : universal analysis for fourdimensional diagnosis, 3D planning and simulation in orthognatic surgery. In: Lemke HU, Vannier MW, Inamura K, Farman AG, Doi K, Reiber JHC, ed: CARS’03, Edimburg, UK, (2003). 1235-1240. 4. Frost S. R., Marcus L. F., Bookstein F. L., et al. Cranial Allometry, Phylogeography, and Systematics of Large-Bodied Papionins (Primates:Cercopithecinae) Inferred From Geometric Morphometric Analysis of Landmark Data: (2003) 1048–1072 5. Maintz J. B. A. and Viergever M. A.. A survey of medical images registration. Medical Image Analysis. Vol. 2 n°1.(1998) 1-37. 6. Hutton T. J., Buxton B. F. and Hammond P.. Automated Registration of 3D Faces using Dense Surfaces Models. In: Harvey R. and Bangham J.A. (Eds.), British Machine Vision Conference, Norwich. (2003). 439-448 7. Yao J.and Taylor R.. Assessing Accuracy Factors in Deformable 2D/3D Medical Image Registration Using a Statistical Pelvis Model. IEE International Conference on Computer Vision. (2003) 8. Cootes T.F., Taylor C.J., Cooper D.H., and Graham J.. Training models of shape from sets of examples, British Machine Vision Conference. (1992) 9. Bookstein F. L., Landmarks methods for forms without landmarks: Morphometrics of group differences in outline shape, Med. Image Anal., vol. 1, no. 3.(1997). 225–243 10. Rønsholt Andresen P., Bookstein F. L., Conradsen K., Ersbøll K., Marsh J. L., and Kreiborg S.. Surface-Bounded Growth Modeling Applied to Human Mandibles. IEEE Transactions on Medical Imaging, VOL. 19, NO. 11, (2000) 1053-1063
3D Meshes Registration: Application to Statistical Skull Model
107
11. Kähler K., Haber J., Seidel H. P. Reanimating the Dead: Reconstruction of Expressive Faces from Skull Data. ACM TOG (SIGGRAPH) 22(3): (2003) 554–561 12. Couteau, B.,Payan, Y., and Lavallée, S. (2000) The Mesh-Matching algorithm : an automatic 3D mesh generator for finite element structures. Journal of biomechanics, 33(8): p.1005-1009. 13. Press W.H., Flannery B.P., Teukolsky S.A. and Vetterling W.T. Numerical Recipes in C: The Art of Scientific Computing, Cambridge, England: Cambridge University Press.]. (1992) 14. Moshfeghi M.. Elastic Matching of Multimodality Images, in Graphical models and Processing, vol. 53, n°3. (1991) 271-282.
Detection of Rib Borders on X-ray Chest Radiographs Rui Moreira1, Ana Maria Mendonça1,2, and Aurélio Campilho1,2 1
2
INEB – Instituto de Engenharia Biomédica, Laboratório de Sinal e Imagem Biomédica Universidade do Porto, Faculdade de Engenharia, Dep. Eng. Electrotécnica e de Computadores Porto – Portugal
Abstract. The purpose of the research herein presented is the automatic detection of the rib borders in posterior-anterior (PA) digital chest radiographs. In a computer-aided diagnosis system, the precise location of the ribs is important as it allows reducing the false positive in the detection of abnormalities such as nodules, rib lesions and lung lesions. We adopted an edge based approach aiming at detecting the lower border of each rib. For this purpose, the rib geometric model is described as a parabola. For each rib, the upper limit is obtained using the position of the corresponding lower border.
1
Introduction
X-ray images are a valuable mean for the diagnosis of the pulmonary condition, as they can provide a general indication of the patient pathological status. From these images it is also possible to obtain specific information related with some diseases, as the pulmonary nodules. However, some relevant diagnostic features are often superimposed by other anatomical structures, as the ribs or the heart [1]. The precise location of these superimposing structures is important to adapt the procedures to the specific changes produced by these structures. In this paper, we focus our attention in the automatic detection of the rib borders in posterior-anterior chest radiographs, in order to segment these anatomical structures and to determine their precise location. This problem has been addressed by other authors using distinctive approaches. Vogelsang et al. [1] defined a parabola as a suitable model for the rib borders. To describe completely each rib, they have used four parabolas, two for the left and right borders of the ventral edges and two for the lower and upper borders of the dorsal ribs. This method achieved a suitable detection, but in some cases manual corrections of the detected rib borders were necessary. In [2], authors have extracted the rib edges using a Canny edge detector and a connectivity method, called “4 way with 10 neighbors”. The reported rate for rib labeling was about 60%. These results were obtained with a database containing some images with an unclear presentation of the ribs. A statistical shape model for the complete rib cage was A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 108–115, 2004. © Springer-Verlag Berlin Heidelberg 2004
Detection of Rib Borders on X-ray Chest Radiographs
109
presented by Ginneken in [3]. Instead of detecting rib border candidates locally and applying rules to infer the rib cage from these candidates, the global rib cage was fitted. Each posterior rib was modeled by two parallel parabolas, one for the lower rib border and the other for the upper one. In the work described in this paper an edge based technique is used to detect the lower borders of the ribs. A set of candidate points is obtained and a set of rules is applied aiming at selecting the correct rib border points. The geometrical model adopted to describe each rib border is a parabola. The upper rib limit is an equidistant version of the parabola associated with the lower one. The distance between the two curves is measured on the direction of the perpendicular to the tangent of the lower parabola. Section 2 reports the methodology used for detecting the lower borders of the ribs. The detection of the upper borders is described in Sect. 3. Some results and conclusions are presented in Sect. 4 and 5, respectively.
2
Detection of the Lower Borders of the Ribs
A set of 30 X-ray chest images is the database used in this work, containing images with contrasted rib borders, but also including images with less contrasted rib borders. Before applying the rib border detection procedure, images are processed for automatically locate the lung field areas. The two main results of this procedure are used in different phases of the rib segmentation process. Further details concerning these methods were described elsewhere [4], [5]. This initial step is required in order to make the rib segmentation task easier and to reduce the processing time. The results of the lung field region of interest (ROI) detection and boundaries delineation is illustrated on Fig. 1a) and 1b), respectively. The rib borders detection task is accomplished in three main phases. In the first phase, each lung ROI is filtered using two differential filters to obtain edge points that are expected to belong to the lower borders of the ribs. The two images that result from the filtering step are processed separately aiming at getting candidate border segments that will be joined together, or eliminated, in the second phase. In the third phase, the upper borders of the ribs are determined. To obtain useful edge points, two different directional filters were selected. The choice of the filter kernels was directed to the adaptation of the filter maximal response to the expected rib slope. The kernels of these filters are represented in Fig. 2. For each filter, the result is a gray-level image which is binarized to allow the skeleton determination. Binarization is performed using the well-known Otsu method [6]. This method was chosen because it is a fully automatic procedure and the results were, in general, good. The subsequent thinning process is required to get one pixel wide segments that are expected to belong to the lower borders of the ribs. Some results of this processing sequence are shown in Fig. 3b), 3c) and 3d), for the original image presented in 3a).
110
R. Moreira, A.M. Mendonça, and A. Campilho
Fig. 1. a) Original image with ROI limits overlapped; b) Original image with automatically detected lung boundaries overlapped.
Fig. 2. a) Kernel 1; b) Kernel 2.
From the analysis of Fig. 3d), it can be observed that some segments must be removed, as they correspond to the clavicle or to anterior ribs. Because their slope is known to vary approximately in the range of 60 to 160 degrees, the Radon Transform was used to detect and remove these segments. The result of this cleaning step is illustrated in Fig. 3e). The processing sequence is continued with a region labeling operation that assigns a distinct label to each individual connected set of edge points. To obtain a description of a particular rib border, the thinned segments that result from the previous steps need to be linked together or eliminated. For this purpose, each connected set of points is approximated by a second-degree polynomial function. Then every pair of connected segments is compared in order to decide if the two segments belong to the same rib border. To accomplish this linking phase, a measure of the number of coincident points between two specific segments was specified. This measure can be understood as the number of consecutive points in the two compared segments such that the vertical distance between the two approximation polynomials is less than an established threshold. In order to decide which connected sets must be join together to define a particular rib, this measure is combined with the distance between the end points of the two curves that are being compared. Figure 4a) shows the curves that are obtained after approximating the connected edge segments by second-degree polynomials. The union of these curves using the referred criteria is presented in Fig. 4b). As can be observed in this figure, because these criteria were not able to insure all the expected connections, a linear approximation of the curves is further used to improve the linking phase.
Detection of Rib Borders on X-ray Chest Radiographs
111
Fig. 3. a) Original right lung ROI; b) Image filtered with kernel 2; c) Binary image; d) Thinned image; e) Cleaned image.
In this step, if the crossing point of the two linear approximations is contained in the rectangular area defined by the terminal points of the curves, as illustrated in Fig. 4c), the original sets of edges are also joined together. At this point, a last tentative to join segments is made, this time using a smaller limit for the number of coincident points. Although this is a less restrictive criterion, it can only be applied when the two compared curves have a small distance between their end points. The final result of this operation is presented in Fig. 4e). As was mentioned before, the previously described processing sequence is applied independently to the binary images that result from the two differential filters. In order to get a final result, the segments from each image are combined using principles similar to those that were adopted for each individual image. Each curve from one image is compared with a curve from the other image using the number of coincident points criterion. However, for each particular curve, the position of each vertex is also calculated and the curves are only linked together if the difference between their vertex row index is smaller than a specified value. The linear approximation step is also applied to improve the final results. Figures 5a) and 5b) show the rib segments obtained with the two filters, and their combination to produce the final result is presented in Fig. 5c). The application of the methods described before to the image test-set revealed two problems, namely missing ribs and the occurrence of rib curves that were not completely defined. The first problem is overcome by measuring the
112
R. Moreira, A.M. Mendonça, and A. Campilho
Fig. 4. a) Approximation of the sets of connected points by second-degree polynomials; b) Union of connected segments; c) Linear approximation of the connected segments; d) Union of the connected segments using the linear approximation; e) Union of the connected segments with a low number of coincident points.
distance between consecutive ribs; if the distance is twice the expected, a new curve is included. To improve the detection accuracy, this new curve is adjusted using the positions of the edge points where the differential filters produced the highest values. To solve the second problem, the detected segments are extended using a recursive algorithm. A few points are extrapolated from the curve and it is adjusted again. The extrapolation/adjusting processes are repeated until the whole rib segment is reconstructed. Figures 6a), 6b) and 6c) illustrate the two mentioned problems; in Fig. 6c) there is a missing detection case and most curves should be extended. In Fig. 6d), a new curve was included between the fifth and the last rib and all the other curves were extended to their maximal extension, as limited by the lung border detection algorithm.
3
Detection of the Upper Borders of the Ribs
The upper border for each detected rib is obtained using the lower border information. In this work, the lower and the upper borders are equidistant parabo-
Detection of Rib Borders on X-ray Chest Radiographs
113
Fig. 5. a) Curves obtained with the first filter; b) Curves obtained from the second filter; c) Detected lower borders of the ribs.
Fig. 6. a) Curves obtained with the first filter; b) Curves obtained from the second filter; c) Combination of results; d) Final result after curve inclusion and curve extension.
las. The distance is measured on the perpendicular to the tangent to the lower parabola. In Fig. 7 the arrows show the difference between distances measured on perpendicular and vertical directions. As the filters were designed to have a maximal response on the lower borders of the ribs, minimal responses are achieved on the upper limit of the ribs. As a result, the upper borders are adjusted using the positions of the local minimal values that result from the initial filtering phase.
4
Results
The proposed method was tested in a set of 30 images and, in general, the obtained results can be considered good. Figure 8 illustrates some useful results obtained with this method. Some images produced less satisfactory results because, frequently, some rib borders are not totally segmented or some detected curves do not correspond to a real rib.
114
R. Moreira, A.M. Mendonça, and A. Campilho
Fig. 7. Difference between distances measured on the vertical and on the perpendicular directions.
Other problems, such as the missing detection of the lowest ribs were also observed. In Fig. 9a) an example of this situation is presented. This result is a consequence of the weak response of the filters in that area. Figure 9b) illustrates an incorrect detection due to the presence of a superimposing white shadow that produces a disturbing response from the filters and does not allow a correct extension of the curve. The lowest rib is not detected because of the low image contrast in that region.
Fig. 8. Some good results obtained with the proposed method.
5
Conclusions
In this paper, a novel approach to automate the process of rib border detection in digital chest radiographs is presented. An edge based technique was used to detect the lower limit of each rib. The upper borders of ribs are obtained using the lower border information. The identified limitations of the proposed method are the occurrence of missing ribs and incorrect positioning of the curves, mainly in the extreme regions of
Detection of Rib Borders on X-ray Chest Radiographs
115
Fig. 9. Some bad results obtained with the proposed method.
the lung fields. These problems are frequently a consequence of the low contrast between ribs and the soft tissue. In some cases, the lowest rib is not detected. However, this situation can easily be overcame by measuring the distance from the lowest detected rib border to the bottom of the lung. The measured distance can be compared with the mean difference between the detected rib borders, to estimate the number of missing ribs, as well as their locations. These results are expected to improve some automatic procedures for the analysis of pulmonary condition related with some diseases, namely the detection of lung nodules.
References 1. Volgelsang, F., Weiler, F., Dahmen, J., Kilbinger M., Wein, B., Günther, R.W., Detection and compensation of rib structures in chest radiographs for diagnose assistance, Proceedings of SPIE, 3338:774-785, 1998 2. Park M., Jin, J.S., Wilson L.S., Detection and labeling ribs on expiration chest radiographs, The Phisics of Medical Imaging Conference, SPIE , pp 1021-1031, 1520 Feb 2003 3. Ginneken, B., Romeny, Bart M. H., Automatic delineation of ribs in frontal chest radiographs, in Medical Imaging 2000:Image Processing, Proceedings of SPIE, 3979:825-836, 2000 4. Monteiro, M., Mendonça, A.M., Campilho, A., Lung field detection on chest radiographs, Proceedings of VIIP02, IASTED International Conference on Visualization, Imaging and Image Processing, 367-372, Spain, 2002 5. Mendonça, A.M., Silva, J., Campilho, A., Automatic delimitation of lung fields on chest radiographs, Proceeding of ISBI04, IEEE International Symposium on Biomedical Imaging, Arlington, USA, 2004 6. Otsu, N., A threshold selection method from gray-level histograms, IEEE Transactions on Systems, Man, and Cybernetics, vol.9, no. 1, pp. 62-66, 1979
Isosurface-Based Level Set Framework for MRA Segmentation Yongqiang Zhao and Minglu Li Department of Computer Science and Engineering Shanghai Jiaotong University, Shanghai,China
Abstract. Segmentation is one of the most important and difficult procedures in medical image analysis and in its clinical applications; and blood vessels are especially difficult to segment. In this paper, we propose an isosurface-based level set framework to extract vasculature tree from magnetic resonance angiography(MRA) volumes. First, we process the extracted isosurface of MRA via the surface normal vectors; then use canny edge detection to compute image-based speed function for level set evolution to refine the processed isosurface for the exact segmentation. Results on cases demonstrate it is the mostly accuracy and efficiency of the approach.
1
Introduction
Vascular diseases are one of the major sources of deaths in the world. Each year approximately 500 000 people suffer a new or recurrent stroke in the U.S., and approximately 150 000 die as a result of this process. A report raises the importance of research needed in the area of angiography, the branch of medicine which deals with veins and arteries[1]. From the clinical point of view, digital subtraction angiography (DSA) is considered the most reliable and accurate method for vascular imaging. On the contrary, this method lacks 3-D information, which is easily available via MR and CT techniques. The MR and CT techniques lack the ability to locate tiny vessels and the morphological estimation of stenoses and aneurysms[1]. However, these vessels, and their branches which exhibit much variability, are most important in planning and performing neurosurgical procedures [2]. An accurate model of the vascular system from MRA data volume is necessary and needed to detect these diseases at early stages and hence may prevent invasive treatments. A variety of methods have been developed for segmenting vessels within MRA [3][4].The vasculature segmentation techniques are further divided into two types: skeleton-based and nonskeleton-based. Skeleton-based techniques are those which segment and reconstruct the vessels by computing first the skeleton of the vessels from the two-dimensional (2-D) slices. These are also called indirect methods, since the vessels are reconstructed by computing the vessel cross sections. Nonskeleton-based techniques are those that compute the vessels in three dimensions (3-D) directly. Here the vessel reconstruction is A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 116–123, 2004. © Springer-Verlag Berlin Heidelberg 2004
Isosurface-Based Level Set Framework for MRA Segmentation
117
done without estimating the vessel cross sections. Given the skeleton center line of the vessels, the difference between these three classes is in the way the cross section of the vessels is estimated. Within the nonskeleton-based methods, the most common method is maximum intensity projection (MIP); it was generated by selecting the maximum value along an optical ray that corresponds to each pixel of the 2-D MIP image. It is easy and fast way for visualization of angiography, meanwhile, MIP can be obtained irrespective of any direction of transverse (fig.1). The major disadvantages of MIP are that, it loses the 3-D information, and it is not helpful for finding stenosis [3]. Krissian et al. [5] developed a 3-D segmentation method for brain vessel segmentation using MRA. The algorithm can be classified as a model-based technique using the scale-space framework. This multiscale vessel model took the initial model as a cylinder, which was too simple for the complex vascular nature of the network. A different multiscale approach based on medial axes uses that assumption that the centerlines of the vessels often appear brightest to detect these centerlines as intensity ridges of the image [6]. The width of a vessel is then determined by a multiscale response function. Udupa et al. developed an algorithm for segmentation of vessels in 3-D based on scale-space fuzzy connectedness [8]. The technique was utilized for separating arteries from veins [9]. But it was only used for CE-MRA data sets; moreover, user was needed for parameter estimation and threshold value [3] .Lorigo et al. presented an algorithm for brain vessel reconstruction based on curve evolution in 3-D, also known as “co-dimension two,” in geodesic active contours [7]. The level set method for capturing moving fronts was introduced by Osher and Sethian [10]. Over the years, it has proven to be a robust numerical device for this purpose in a diverse collection of problems. The advantages of the level set representation are that it is intrinsic (independent of parameterization) and that it is topologically flexible. Hossam used a level set based segmentation algorithm to extract the vascular tree from Phase Contrast Magnetic Resonance Angiography, The algorithm initializes level sets in each image slice using automatic seed initialization and then iteratively, each level set approaches the steady state and contains the vessel or non-vessel area. The approach is fast and accurate [11]. In this paper, we propose an isosurface-based level set framework to extract vascular tree from MRA volumes. First, we process the extracted isosurface of MRA via the surface normal vectors; then use canny edge detection to compute image-based speed function for level set to refine the processed isosurface for the exact result. The rest of this paper is organized as follows. The introduction of level set method and its application in medical image science in Section 2. Then, we will illustrate our framework in Section 3 and the results in section 4, In section 5 , our conclusions are stated with some pointers for future work.
118
2
Y. Zhao and M. Li
Level Set Method
The basic idea of level set method is to start with a closed curve in two dimensions (or a surface in three dimensions) and allow the curve to move perpendicular to itself at a prescribed speed. Consider a closed moving interface in with co-dimension 1, let be the region (possible multiply connected) that encloses. We associate with an auxiliary scalar function which is known as the level set function. Over the image area, it is a surface which is positive inside the region, negative outside and zero on the interfaces between regions. It can be described in mathematical form:
The evolution equation for the level set function formula:
takes the following
The evolution of the level set function is determined by the speed function F. A number of numerical techniques make the initial value problem of Eq.(4) computationally feasible. The two of the most important techniques are “up-wind” scheme which addresses the problem of overshooting when trying to integrate Eq.(4) in time by finite differences and “narrow band” scheme that solve Eq(4) in a narrow band of voxels near the surface. Typically, in segmentation applications,the speed function may be made up of several terms.Eq(4)always is modified as follows:
where and are speed terms that can be spatially varying. is an expansion or contraction speed. is a part of the speed that depends on the intrinsic geometry, especially the curvature of the contour and/or its derivatives. is an underlying velocity field that passively transports the contour [12]. Level set segmentation relies on a surface-fitting strategy, which is effective for dealing with both small-scale noise and smoother intensity fluctuations in volume data. The level set segmentation method creates a new volume from the input data by solving an initial value PDE (5) with user-defined feature extracting terms. Given the local/global nature of these terms, proper initialization of the level set algorithm is extremely important. Thus, level set deformations alone are not sufficient; they must be combined with powerful initialization techniques in order to produce successful segmentations [13]. In another point, to get a good segmentation result, the speed term F also plays a key role. F depends on many factors including the local properties of the curve, such as the curvature, and the global properties, such as the shape and the position of the front.
Isosurface-Based Level Set Framework for MRA Segmentation
3
119
Segmentation Framework
The proposed segmentation framework is composed of two stages: initial segmentation, and level set refinement including connectivity filtering. The flowchat of the framework is shown in Fig.1.
Fig. 1. Isosurface-based Level Set Framework
3.1
Initial Segmentation
Isosurfaces of blood vessels, the colon, and other anatomical structures can have a highly realistic appearance. In MR blood imaging techniques, time-ofFlight(TOF) MRA utilizes the in-flow effect. Using a very short repetition time during data acquisition, the static surrounding tissue becomes saturated, resulting in low signal intensity in the acquired images. In contrast, the replenished flowing spins are less saturated, providing a stronger the signal, which allows for the vessel to be differentiated from the surrounding tissues. Especially in contrast-enhanced MRA, by the injection of a contrast agent, high contrast can be obtained in shorter examination times. From these points, we can make such assumption: the most and main vasculature can be thought as on the same iso-surface. An isosurface is defined as a surface in which all the vertices are located at points in the image that share the same intensity.
where is the linearly interpolated image intensity at the location of vertex, T is the iso-intensity value, and S is the set of vertices in the isosurface. The isosurface can be constructed by the Marching-Cubes algorithm[15].
120
Y. Zhao and M. Li
After extracting the isosurface, we adopt the approach presented by Tasdizen to smooth and evolve this isosurface. It uses level set to represent the isosurface, and it splits the surface deformation into a two step process that (i) solves anisotropic diffusion on the normal map of the surface, and (ii) deforms the surface so that it fits the smoothed normals.This approach is based on the proposition that the natural generalization of image processing to surfaces is via the surface normal vectors. The variation of the normals has more intuitive meaning than the variation of points on the surface. A smooth surface is one that has smoothly varying normals and creases on a piecewise smooth surface appear as discontinuities in the normals[14]. Here we use anisotropic diffusion as it can overcome the major drawbacks of conventional filtering methods and is able to remove noise while preserving edge information.
3.2
Level Set Refinement
However, isosurfaces have limited value for shape analysis since the image intensity of an object and the background are usually inhomogeneous. Thus, a single iso-intensity value usually cannot provide an accurate segmentation of an entire anatomical structure. This effect is pronounced in MRA[15]. To solve this problem, in next stage, we use level set segmentation to refine the initial segmentation, That means the result of the first stage will be thought as the initial model of the level set segmentation. In this stage, it constructs a speed function which is designed to lock onto edges as detected by a Canny edge detector. As the initial model is already close to the boundaries of vasculature tree, the expansion speed is removed.It works by first constructing the advection field speed based on edge features in the image. The level set front is then moved according to this terms with the addition of the curvature term to control the smoothness of the solution. The speed term is constructed as the Danielsson distance transform of the Canny edge volume, as calculated by the Canny Edge detector in Fig.1. Here the advection field term is constructed by minimizing Danielsson distance squared.
This term moves the level set down the gradient of the distance transform.The evolution equation for the above level set function takes the following formula:
where A(x) is advert term. Z(x)is a spatial modifier term for the curvature
Isosurface-Based Level Set Framework for MRA Segmentation
121
Fig. 2. From left to right: brain vessels TOF-MRA volume, leg vessels of CE-MRA volume; from top to down: the first row is MIP representation of the MRA volumes, the second and third row are initial and final segmentation results) respectively
122
4
Y. Zhao and M. Li
Results
We have run the proposed framework on various images. In Fig.2 , the left is brain tof-MRA volume, of size the right side is leg CE-tof-MRA volume, of size the first row are the MIP representation of the two volumes; the second and third are initial and final segmentation results by using the proposed framework respectively, visualization by marching cube algorithm. Comparing the segmentation results with 2D MIP image, they show almost the same vasculature tree as the MIP; meanwhile, they provide spatial information for doctor; Comparing the initial and final results, we can see that final results are less noised and more exact after the refinement by level set method. However, isosurfaces have limited values for shape analysis since the image intensity of an object and the background are usually inhomogeneous. In MRA, other soft tissues may share the same range of grey levels as the vessels. In such condition, our proposed method cannot provide an accurate segmentation of an entire anatomical structure. In the right side of Fig.2, although the final result provides more 3D information of vascular tree than in 2D MIP, other tissue or organs that are not of interest can obscure visualization of our interest. As a result, some vessels are hard to be distinguished.
5
Conclusion
In this paper, we present a two stage isosurface-based level set framework for MRA segmentation. In the first stage,it considers the global property to finish the initial segmentation; in the second stage, we use the local properties including curvature and edge to design the speed function.This level set framework has practical appeal for vascular segmentation from MRA. As we have said, blood vessels are especially difficult to segment. We are still far away from achieving the robust segmentation in real time. Future work will add more image properties such as fuzzy connectedness and differential geometry features to initialization and speed function to enhance the segmentation results and make the framework more robust. Acknowledgement. We would like to thank Ms. Zhang Lei of Tong ji Hospital Affiliate to Tong ji University for providing the MRA images. This work is supported by Dawning Program of Shanghai, China (grant #02SG15).
References l. Jasjit S. Suri, Kecheng Liu, Laura Reden, and Swamy L.: A Review on MR Vascular Image Processing Algorithms: Acquisition and Prefiltering: Part I. IEEE Trans.Information technology in Biomedicine, vol.6, no.4, pp. 324-337, Dec. 2002. 2. Liana M. Lorigo, Olivier D. Faugeras, W. Eric L. Grimson1, Renaud K.: CURVES: Curve Evolution for Vessel Segmentation, Medical Image Analysis, 5:195-206, 2001.
Isosurface-Based Level Set Framework for MRA Segmentation
123
3. Jasjit S. Suri, Kecheng Liu, Laura Reden, and Swamy L.: A Review on MR Vascular Image Processing: Skeleton Versus Nonskeleton Approaches:Part II. IEEE Trans.Information technology in Biomedicine, vol.6, no.4, pp. 338-350, Dec. 2002. 4. Kirbas, C., and Quek, F.: Vessel Extraction Techniques and Algorithms: A Survey. IEEE Conference Bio-Informatics and Bio-Engineering (BIBE), 2003, pp. 238-245. 5. K. Krissian, G. Malandain, N. Ayache, R. Vaillant, and Y. Trousset: Model Based Detection of Tubular Structures in 3D Images Computer Vision and Image Understanding, vol. 80, no. 2, pp. 130-171, Nov. 2000. 6. S. R. Aylward, E. Bullitt, S. M. Pizer, and D. Eberly: Intensity ridge and widths for tubular object segmentation and registration, Proc. IEEE Workshop Mathematical Methods in Biomedical Image Analysis, 1996, pp. 131–138. 7. L. M. Lorigo, O. Faugeras,W. E. L. Grimson, R.Keriven, R. Kikinis, and C.-F. Westin: “Co-dimension 2 geodesic active contours for MRA segmentation, Proc. 16th Int. Conf. Inform. Processing Med. Imaging, vol. 1613, Visegrad, Hungary, June/July 1999, pp. 126-139. 8. P. K. Saha, J. K. Udupa, and D. Odhner: Scale-based fuzzy connected image segmentation: Theory, algorithm, and validation, Comput. Vision Image Understanding, vol. 77, no. 2, pp. 145-174, 2000. 9. T. Lei, J. K. Udupa, P. K. Saha, and D. Odhner: MR angiographic visualization and artery-vein separation, Proc. SPIE, vol. 3658, pp. 58-66,1999. 10. S. Osher and J. A. Sethian: Fronts propagating with curvature dependent speed: Algorithms based on Hamilton-Jacobi formulations, Journal of Computational Physics 79, pp. 12-49, 1988. 11. Hossam El Din, Abd El Munim: Cerebrovascular segmentation for MRA data using level sets, International Congress Series Volume: 1256, June, 2003, pp. 246-252 12. Xiao Han, Chenyang Xu, and Jerry L.Prince: A Topology Preserving Deformable Model Using Level Sets, Proc. IEEE Conf. CVPR 2001, vol. II, pages 765–770, Kauai, HI, Dec 2001. 13. Ross Whitaker, David Breen, Ken Museth, and Neha Soni: A Framework for Level Set Segmentation of Volume Datasets, Proceedings of ACM International Workshop on Volume Graphics. pp. 159-168, June 2001. 14. Tolga Tasdizen,Ross Whitaker,Paul Burchard and Stanley Osher:Geometric surface smoothing via anisotropic diffusion of normals.Proceedings of the conference on Visualization 2002,Boston, Massachusetts,pp. 125 - 132 15. Peter J. Yim*, G. Boudewijn C, Vasbinder, Vincent B. Ho, and Peter L. Choyke: Isosurfaces as Deformable Models for Magnetic Resonance Angiography.IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 22, NO. 7,pp.875-881, JULY 2003
Segmentation of the Comet Assay Images Bogdan Smolka1* and Rastislav Lukac2 1
Silesian University of Technology, Department of Automatic Control, Akademicka 16 Str, 44-100 Gliwice, Poland 2 The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto ON, M5S 3G4, Canada
Abstract. The single cell gel electrophoresis, called Comet Assay, is a microelectrophoretic technique of direct visualization of the DNA damage at the cell level. In the comet assay, the cells suspended in an agarose gel on a microscope slide are subjected to lysis, unwinding of DNA and electrophoresis. Under the influence of weak, static electric field, charged DNA migrates away from the nucleus, forming the so called comet. The damage is quantified by measuring the amount of the genetic material, which migrates from the nucleus to form the comet tail. In this paper we present three novel methods of the comet tail and head extraction, which allow to quantify the cell’s damage.
1 Introduction The comet assay, (single cell gel electrophoresis, SCGE or microgel electrophoresis, MGE) is a useful method for quantifying cellular DNA damage caused by different genotoxic agents. The idea of single cell electrophoresis as a method of measurement of the DNA damage was introduced by Rydberg and Johanson, [7] and the comet assay was introduced by Östling and Johanson, [5]. In the original form of the assay, cells embedded in agarose on microscope slides were lysed and subjected to electrophoresis under neutral conditions, enabling the detection of DNA double strand breaks. A later modification introduced by Singh, [8] made it possible to detect the DNA single strand breaks and alkali labile sites using the alkaline conditions. The assay was named for the characteristic shape of the DNA, flowing from the nucleus and migrating under the influence of applied static electric field, (see Fig. 1). The measurement of the DNA in the comet’s tail enables to quantify the intensity of DNA damage caused by various genotoxic agents. In the recent years the use of the comet assay has grown considerably, as this method detects damages with high sensitivity and it is relatively fast and reliable. As a result, this method of detection of the DNA strand breaks on the individual cell level is now in wide use in genetic toxicology and oncology. One of the application of the assay is the analysis of the effects of the ionizing radiation, [8, 2, 4] on the DNA structure. The formation of a comet may be a result of the DNA single strand breaks (SSB), double strand breaks (DSB) *
This research has been supported by the KBN grant 4T11F01824
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 124–131, 2004. © Springer-Verlag Berlin Heidelberg 2004
Segmentation of the Comet Assay Images
125
and alkali labile sites. Using different assay pH conditions allows the study of either SSB or DSB. This ability of analyzing these two kinds of DNA damage is another important advantage of the comet assay. The irradiation is known to produce SSB as well as DSB in DNA and it has been hypothesized that strand breaks repair capacity, (especially DSB) may be the basis for clinically observed radiation sensitivity, [4].
Fig. 1. Typical comet assay image (a) and its model (b), below the intensity-sliced image (c) and manually determined boundary of the comet’s head, halo and tail, (d).
In our investigations alkaline comet assay is applied for the evaluation of the individual radiosensitivity of patients with cervical carcinoma, head and neck cancer, in order to validate the usefulness of this method as a predictive test for normal tissue response. For the experiments, the peripheral blood lymphocytes were taken from patients before the beginning of radiotherapy. Comet assay was performed according to Singh, [8] with some modifications described in [13]. The comets were observed using a fluorescence microscope at 400-fold magnification and the images were acquired using a 512x512, 256 gray-level frame grabber and stored on a computer hard disk. From the laboratory database, a collection of 30 digital comet images was prepared and all experiments were performed on these these pictures.
126
2 2.1
B. Smolka and R. Lukac
New Methods of Comet’s Tail Extraction Probabilistic, Iterative Approach
The algorithm introduced here is based on a model of a virtual particle, which performs a random walk on the image lattice. It is assumed, that the probability of a transition of the jumping particle from a lattice point to a point belonging to its neighborhood is determined by a Gibbs distribution, defined on the image lattice with the eight-neighborhood system, [10,11,12]. Using this model, the image is treated as a realization of a Markov random field and it is assumed that the information on the local image properties is contained in the partition function of the local statistical system. Let the image be represented by a matrix I of size and let us introduce a virtual particle which can perform a random walk on the image lattice visiting its neighbors or staying at its temporary position. In this work it is assumed, that the particle moves on the image lattice with probabilities of a transition from the point to derived from the Gibbs distribution formula
where the symbol denotes the neighborhood relation, inverse of temperature of the statistical system and tion, (statistical sum) of the local structure,
plays the role of the is the partition func-
As we work with noisy images, let us assume that the value of the pixel is set to zero, which means that Then, the value of the pixel at a position is not taken into consideration, when calculating the probability Under this assumption
The probability that the virtual jumping particle will stay at its current position with and will not escape from is then given by
Assigning to each image point the probability that the randomly walking particle will stay at its current position leads to a map of probabilities, which
Segmentation of the Comet Assay Images
127
Fig. 2. Image segmentation using the probabilistic, iterative approach: a) test image (Fig. 1a) and the output image after 5, 10 and 20 iterations, (b), c) and d) respectively) .
can be treated as a new image. The successive iterations, lead to a binary image consisting of the comet head and its tail, (see Figs. 2 and 3a), [10,12]. All binary images depicted in Fig. 3a) were obtained using the same parameter of the Gibbs distribution. Having the information about the comets’ tail, different calculations of the distribution of the DNA which escaped from the cell nucleus can be performed.
2.2
Region Based Segmentation
In [6] an effective image segmentation method, which works without defining the seeds needed to start the segmentation process, was proposed. This method, originally developed for the vector valued color images, can be also applied for the segmentation of gray level images. At the beginning of the algorithm, each pixel has its own label, (the image consists of one-pixel regions). In the construction of the algorithm, the 4neighborhood system was used to increase the computational efficiency of the method. For the region growing process, the centroid linkage strategy is used. This strategy adds a pixel to a region if it is 4-connected to this region and has a color or gray scale value lying in a specified range around the mean value of an already constructed region. After the inclusion of a new pixel, the region’s mean color value is being updated. For this updating, recurrent scheme can be applied. In the first step of the algorithm, a simple raster scan of the image pixels is employed: from left to right and from top to bottom. Next pass, in this two-stage method, starts
128
B. Smolka and R. Lukac
from the right bottom corner of the image. This pass permits additional merging of the adjacent regions, which after the first pass, possess features satisfying a predefined homogeneity criterion. During this merging process, each region with a number of pixels below a specified threshold is merged into a region with a larger area, if the homogeneity criterion is fulfilled. After the merging, a new mean color (intensity) of a region is calculated and the labels of pixels belonging to a region are modified. The segmentation results are strongly determined by the design threshold, which defines the homogeneity criterion. The segmented image can be further post-processed by removing small regions that are usually not significant in further stages of image processing. Their intensities are different from the intensity of the object and its background. Postprocessing needs additional third pass from the top left corner to the bottom right corner, whose aim is to remove the regions, which consist of a number of pixels smaller than a certain threshold area. During this algorithm step, small regions are merged with the neighboring regions, which are closest in terms of a color or intensity distance. The described region-based segmentation technique has been applied to the segmentation of gray level comet assay images. As already mentioned, the segmentation technique works also for single channel images. The only difference is that instead of the color distance between pixels in a specific color space, the absolute difference of their gray scale values is used. The results of the segmentation of comet assay images using the described technique are presented in Fig. 3b). As can be seen this method detects well the comet head and tail. Of course the results are slightly different from those delivered by the previous algorithm, however they correspond well with the assessment of a human observer.
2.3
Active Contour Segmentation
In the past decades image segmentation has played an increasingly important role in medical imaging. Image segmentation still remains a difficult task, due to tremendous variability of medical objects shapes and the variations of image quality affected by different sources of noise and sampling artifacts. To address these difficulties, deformable contours have been extensively studied and widely used in medical image segmentation. Deformable contours are curves defined within an image domain that can evolve under the influence of internal and external forces. The internal forces, which are defined within the curve itself, are designed to keep it smooth during deformations. They hold the curve together through elasticity forces and keep it from too much bending through the bending forces, [3, 14, 1, 9]. The external forces, which are computed from the image data, are defined to move the model toward an object boundary and attract the curve toward the desired object boundaries. The evolution of an active contour can be described as a process of minimization of a functional representing the contour energy, consisting of internal and potential energy terms.
Segmentation of the Comet Assay Images 129
130
B. Smolka and R. Lukac
The internal energy specifies the tension or the smoothness of the contour, whereas the potential energy is defined over the image domain and has local minima at the image edges. A deformable contour is a curve which evolves on the image domain to minimize the energy functional
is the internal energy. The first-order derivative discourages stretching and makes the contour behave like an elastic string, while the second order derivative discourages bending and makes the model to behave like a rigid rod. The second term is the potential energy
where the potential function is derived from the image data and takes smaller values at object boundaries. If denotes the gray level value at then
where is a two-dimensional Gaussian, denotes the convolution operation and is a prameter. The curve that minimizes the total energy must satisfy the Euler-Lagrange equation:
This equation says that where the internal force is given by To find a solution of the energy minimization problem, the deformable contour is made dynamic by treating as a function of time Then we have to solve
When the solution stabilizes, the left side of the above equation is 0 and we achieve a solution of the total energy minimization. In practical applications special external forces, (damping force, multiscale potential force, pressure forces, distance potential force, dynamic distance force, interactive forces etc.) can be added to the energy minimization scheme. The results of the segmentation of comet assay images using the described active contour technique are presented in Fig. 3c).
Segmentation of the Comet Assay Images
3
131
Conclusions
The single cell gel electrophoresis is a powerful tool that can indicate lesions in nucleus DNA caused by various genotoxic agents. However, the lack of standardization is a serious obstacle for evaluating and comparing results obtained in different laboratories. In this paper three novel methods of comet’s tail and head extraction were proposed. As can be seen the presented methods detect well the comet head and tail, despite of strong noise present in the comet assay images. The results obtained using different algorithms are not identical, however they all correspond well with the assessment of a human observer. In the future work we will examine, which of the proposed segmentation methods yields the best results in the practical evaluation of the comet assay results.
References 1. Cassales, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. J. Comp. Vision, 22, (1997) 61-69 2. Fairbairn, D.W., Olive, P. L., O’Neil, K.L.: The comet assay: a comprehensive review. Mutation Research, 339, (1995) 37-59 3. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comp. Vision, 1, (1987) 321-331 4. Olive, P.L.: DNA damage and repair in individual cells: applications of the comet assay in radiobiology. Int. J. Radiat. Biol, 75, 4, (1999) 395-405 5. Östling, O., Johanson, K. J.: Microelectro-phoretic study of radiation-induced DNA damage in individual mammalian cells. Biochemical and Biophysical Research Communications, 123, (1984) 291-298 6. Palus, H., Bereska, D.: Region-based Colour Image Segmentation. Proc. of 5th Workshop on Color Image Processing, Ilmenau, Germany, (1999) 67-74 7. Rydberg, B., Johanson, K.J.: Estimation of DNA strand breaks in single mammalian cells. DNA Repair Mechanisms, P.C. Hanawalt, E.C. Friedberg, C.F. Fox Editors, Academic Press, New York, (1978) 465-468 8. Singh, N.P., McCoy, M.T., Tice, R.R., Schneider, E.L.: A simple technique for quantification of low levels of DNA damage in individual cells. Experimental Cell Research, 175, (1988) 184-191 9. Singh, A., Goldgof, D., Terzopoulos, D.: Deformable models in medical image analysis. IEEE Computer Society (1998) 10. Smolka, B., Wojciechowski, K.: A new method of texture binarization. Lecture Notes in Computer Science, 1296, (1997) 629-636 11. Smolka, B., Wojciechowski, K.: Contrast enhancement of badly illuminated images. Lecture Notes in Computer Science, 1296, (1997) 271-278 12. Smolka, B., Wojciechowski, K.: Random walk approach to image enhancement. Signal Processing, 81, (2001) 465-482 13. Wojewódzka, M., Kruszewski, M., Iwanienko, T., Collins, A.R., Szumiel, I.: Application of the comet assay for monitoring DNA damage in workers exposed to chronic low-dose irradiation. Mutation Research, 416, (1998) 21-35 14. Xu, C., Prince, J.L.: Snakes shapes and gradient vector flow. IEEE Trans. Imag. Proc, 7, (1998) 359-369
Automatic Extraction of the Retina AV Index I.G. Caderno1, M.G. Penedo1, C. Mariño1, M.J. Carreira2, F. Gomez-Ulla3, and F. González3 1
Grupo VARPA. Dpto. de Computación. Universidade da Coruña. Spain. {igcaderno,cipenedo,castormp}@dc.fi.udc.es
2
Dpto. Electrónica e Computación. Universidade de Santiago de Compostela. Spain.
[email protected] 3
Complejo Hospitalario Universitario de Santiago de Compostela. Spain
Abstract. In this paper we describe a new method to approach the diameter of veins and arteries in the retina vascular tree, focusing not only on precision and reliability, but also on suitability for on-line assistance. The performed system may analyze the region of interest selected in the image to estimate the retinal arteriovenous index. This analysis involves two different steps: the blood vessels detection, which extracts the vascular structures present in the image, and the blood vessel measurement, which estimates the caliber of the already located vessels. The method may locate 90% of the structures, giving a reliability of 99% in detection and 95% in measurement. Keywords: arteriovenous index, retina angiography, creases extraction algorithms, Canny filter, snakes
1
Introduction
Nowadays, automatic analysis of blood vessels from medical images has become very significant, taking part in many clinical investigations and scientific researches. Actually, the acquisition faults, the low-contrasted regions or the anatomical background noise make the automatic detection a great advance in this field. In particular, blood vessels of the retina are the first ones that show some symptoms of several pathologies, such as the arterial hypertension, the arteriosclerosis and other systemic vascular diseases. As a result, the retina arteriovenous index (AV index) takes a vital priority in order to diagnose these illnesses and evaluate their consequences. That is the reason why a precise and robust estimation of this parameter is usually needed. In spite of this, most of the studies are manually made and subjetive. This paper deals with the research of a new methodology, and the development of a new half-automatic system, which allows measuring the AV index on the human retina. The designed technique may be seen from two different points of view: vessel detection and vessel measurement (Figure 1). The detection system is mainly based on a creases extraction algorithm and a vessel tracking paradigm. According to several studies [1,2], the inherently-radial vascular tree of the eye, makes A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 132–140, 2004. © Springer-Verlag Berlin Heidelberg 2004
Automatic Extraction of the Retina AV Index
133
that the retina AV index have to be calculated from the optic nerve to equidistant vessels. Therefore, a specialist is required to indicate two initial parameters in the image: the optic disk centre and the analysis circumference radius. Hence, the region of interest is just defined as a circular zone around the optic nerve, with a given analysis radius. The proposed methodology tries to identify every blood vessel in this region. For this purpose, the intersection between each crease and the circumference is searched. Once the vein or artery is confirmed by a tracking algorithm, the measurement system starts to work. In this process, the registered creases are used as the seed of deformable models, responsible of computing the vessels diameter. A snake is employed to keep domain-specific knowledge, such as the size or the shape of the structures.
Fig. 1. Proposed methodology. The main steps of both subsystems.
This paper is organised as follows. Section 2 explains the vessel detection system, both from computing the creases and locating the right veins positions. Section 3 describes the snake model used in this project, pointing to the domain knowledge included to control its deformation. This section also indicates how to measure the vessel from the final configuration. Section 4 shows and comments some results we have obtained with the system. Finally, section 5 presents some conclusions and future work in this field.
2
Vessel Detection System
Amongst the many mathematical definitions of ridges and valleys [3], a crease may be defined as a continuos zone of points on the image, shaping a highest or a lowest level in its environment. In this way, when images are seen as landscapes, blood vessels can be considered as ridges or valleys, that is, regions which form an extreme and tubular level on their neighbourhood. This fact allows to locate the vessels on the analisys radius by using the creases position. The creases image is obtained using the MLSEC-ST operator [4] (multilevel set extrinsic curvature based on the structure tensor). Given a function the level set for a constant consists in the set of points For 2D images
134
I.G. Caderno et al.
L can be considered as a topographic relief or landscape, where the level sets are its level curves. The level curvature can be defined, according to MLSECST, through the slopelines, that is, the lines integrating the gradient vector field orthogonal to the level curves. For this purpose, the divergence operator is defined as follows:
where
is the ith component of
the normalized vector field of
being the zero vector. In order to obtain a more enhanced and homogeneus creaseness measure, the gradient vector field of the image is prefiltered before applying the divergence operator. This filtering is based on the structure tensor or moment tensor [4]. In this way, several measures such as the confidence degree (CFND), the minimum grey level (THRK), the deviation of a Gaussian smoothing (SI), or the minimun length (BMIN) may be fine-tuned. Adjusting these values is a fundamental step to obtain high-quality results in all kind of images. In fact, the calculus of a well-defined image will be different of a low-constrasted one. In order to manage an automatic parameters adjustment, a new procedure will be included to the system. This procedure should be able to estimate the best value for every parameter according to the particular features of every image. N categories of images must be created to classify the characteristics which most affect each parameter. We have defined N = 4 and a label set based on the contrast (low, normal, high, very high). This labelled contrast should be examined in the region of interest: the grey level of the pixels in the analysis circumference can be represented in a linear way as an intensity profile (Fig. 2). Vessels should present darker grey levels in the image, so they must appear as valleys in the profile representation. Thus, a higher contrast in the region of interest implies deeper valleys in the intensity profile representation. For this purpose, extreme values (that is, maximums and minimums), have to be located in the smoothed profile. Connecting these extreme values we obtain an irregular signal (Fig. 2b). At this point the contrast estimation is easier, because a higher variability among maximums and minimums implies a higher contrast in the search region. The variability is given by the equation: being the number of extreme values and its average. These fluctuations may be discretized, establishing a strong relationship between the constrast and the variability in each category (the empirically calculated values for the parameters are shown in table 1). Now that we have the creases image, the candidate points for being considered vessels will be given by the intersection between each crease and the analysis circumference. In order to discard noise and confirm the vein or artery, a vessel tracking must be applied to the region of interest, using concentric sequences on the nearest neighbourhood of the analysis radius Having
Automatic Extraction of the Retina AV Index
135
Fig. 2. Process to calculate the creases from the original image. (a) Region of interest in the angiography. (b) Intensity profile representation (top), after smoothing (middle) and the variability signal (bottom). (c) Crease image ready for detecting vessels.
noticed the intersections among every analysis sequence and every crease, the next step (sequences registration) is just to find the correspondences which link the blood vessels between sequences. A voting system is established to decide whether a belongs to a vessel, according to the angle between and the optic nerve centre. Only the vessels which have more than votes should be considered.
3
Vessel Measurement System
A snake or active contour [5] may be defined as a polygonal surface which, placed in the image, can evolve to fit the shape of the structure we want to delimit. Each vertex of the snake is moved according to the forces that work on it, and the general contour becomes stable when every vertex reaches the minimum in its energy function (in our case, when it reaches the vessel contour). This energy function may be defined as:
136
I.G. Caderno et al.
where represents the internal energy (which corresponds to the snake flexibility and elasticity), and the the external energy (which corresponds to the forces that push the snake towards the edges of the shape to locate). 3.1 Model design. The snake model described in this paper has been adapted, from the original scheme [5], to use the domain knowledge for the problem specifications. This adaptation includes the seed, the shape and the energy terms which drive the growing direction and way of every vertex. The seed may be just the crease segment between the lowest and the highest sequence in the registration step. The initial contour from the crease may be performed setting two parallel chains of nodes, one for each side, as shown in Fig. 3. The number of initial nodes of a snake is being the amount of points in the crease and a constant parameter which controls the sampling stage, that is, the model resolution. The direction must be perpendicular to the seed angle, whereas the way must move the nodes from the seed to the vessel edge.
Fig. 3. Initial distribution of the snake nodes in the seed.
The internal energy. The internal spline energy can be defined as:
where the first term refers to the first-order derivative and the second term refers to the second-order derivative. The first-order term, controlled by makes the snake behave like a membrane, whereas the second-order term, controlled by makes it act like a thin plate [5]. In our model, different values are used according to the kind of node to move, as shown in the next expression:
Automatic Extraction of the Retina AV Index
137
so corner nodes have no internal energy. This fact implies first and second level discontinuities in the contour corners. Thus, the corner nodes can fit better the vessel edges, as no attraction force will try to keep then together. In our case, has been set to 0.25 and to 0.01. The external energy. As long as the external energy is the responsible of performing the outer forces, this term includes more knowledge about the domain. The external energy equation we propose for the model is:
where represents the dilation pressure, is the edge-driven distance, is the gradient energy and the stationary energy. The parameters and determine the weights of these terms in the global energy equation. The first term, the dilation pressure, is a vectorial magnitude, whose main target is just to assign an advance direction and way to each vertex. The second term, the edge-driven distance, indicates to every node the distance to the nearest edge, but just in its advance direction and way.
being the edge image, computed from the original angiography by applying a Canny filter [6]. The third term, the gradient energy, is just a stopping force defined by:
where is the grey level in the original image I, on the position of node and is the grey level on the posible new position of the node in the advance direction. Trying to prevent the noise effects in this term, the points in the grey level transition are smoothed with a median filter. Finally the fourth term, the stationary energy, is also a stopping force. This expression allows measuring the adjustment among each node and its neighbours and has been incorporated to resolve the intravenous noise and the edges discontinuities. The underlying idea is the following: whether all the adjacent nodes have reached the energy minimum, there is a huge probability of stopping soon the current node. On the opposite, it is not very probable that, if all the adjacent nodes keep on moving, the current node has to stop:
where represents the external energy in a subset of nodes centred on a vertex As a limited function is required in the crease-edge interval, the exponent of this expression must be positive to get a constrained value. According to the evolution of the different terms between the crease and the edge, we have set and
138
I.G. Caderno et al.
3.2 Topological check of the model. Since we know the final configuration the model should achieve, wrong nodes can be detected and corrected using a regression model. To locate these wrong nodes diverse indexes can be employed:
where dist represents the shortest distance from the seed to the node, the angle between adjacent nodes, ang(seed) the angle between the seed and the axis, the number of iterations needed to reach the final position, and the number of nodes in each side of the snake. and are constant parameters, indicating the highest allowed caliber, the threshold to compare angles and the threshold to compare iterations respectively. As soon as the wrong nodes are marked, their positions can be corrected through a linear regression model, using the coordinates of each right positioned vertex as the model variables. In order to guarantee an appropiate correction for wrong nodes, it is necessary for the whole snake to verify a right nodes proportion and a regression accuracy. Through the first condition, we warrant an exact linear slope by demanding a minimum proportion of right nodes in each side of the snake (70% - 90%). Through the second condition, the regression accuracy may determine if the model is suitable enough to describe the relationship between the coordinates. In this way, it is useful to employ the correlation coefficient: where is the model covariation, the variation and the variation. Values for near 1 indicate a good model adjustment, whereas values near 0 indicate a bad adjustment. Values for must be in [0, 80, 0, 95].
4
Results and Conclusions
To evaluate the system, 100 different images from retina angiographies have been taken from three primary attention centres in the Complejo Hospitalario Universitario de Santiago (CHUS). All the images had a resolution of 1024x1024 pixels and 256 grey levels. This motion sets to 14 pixels, to 0.081 rads and to 4 loops. The evaluation process was executed by several ophthalmologists, who selected the main regions of interest in the images and decided whether the obtained results were correct. Figure 4 shows some examples of the results. To carry out the analysis, two different studies were performed: one to evaluate the detection subsystem and the other to evaluate the measurement subsystem. The vessel location model was tested on many research regions in a subset of 100 images (more than 1.300 vessels to locate). This system was able to detect 1.183, the 91% of the totality. The incorrect or irrelevant vessel location has been reduced down to 1%. (Improving the results obtained by most of the systems which deal with the same problem [7,8]). The measurement system was tested comparing the results with the manually made ophthalmologists measurements. According to these comparations, our model manage good estimations in
Automatic Extraction of the Retina AV Index
139
Fig. 4. (a) Results obtained by the system. (b) Statistical results.
the 95% of the cases. The results are even better after applying the topological control step: two of three imprecise measurements are automatically rejected. This fact implies that more than the 98% of the vessel diameters are correct. In conclusion, we have described a new methodology which deals with two robust schemes to calculate the retinal AV index. On the one hand, the scheme for detecting vessels using a crease tracking has reached very good results, obtaining just the representative structures (noise is fully-discarded in most cases), and becoming very suitable for arteries and veins location. On the other hand, the measurement model has incorporated all the useful environment knowledge for locating the vessel edges. These measures have been studied by diverse ophtalmologist, concluding a very reliable model for computing the AV index, and a very suitable tool for on-line assistance. Acknowledgements. This paper has been partly funded by the Xunta de Galicia and the Ministerio de Ciencia y Tecnología through the grant contracts PGIDIT03TIC10503PR and TIC2003-04649-C02-01 respectively.
References 1. E. Aurell A. Kagan and T. Tibblin. A note of signs in the fundus oculi and arterial hypertension conventional assessment and significance. Bull. World Health Organ., 34:955–960, 1967. 2. F. Mee A. V. Stanton and col. A method of quantifying retinal microvascular alterations associated with blood pressure and age. J. Hypertens., 13:41–48, 1995. 3. A. M. L. Peña. Multilocal Methods for Ridge and Valley Delineation in Image Analysis. PhD thesis, Universitat Autónoma de Barcelona, 2000. 4. J. Serrat A. López, D. Lloret and col. Multilocal creaseness based on the level set extrinsic curvature. Computer Vision and Image Understanding, 77:111–114, 2000.
140
I.G. Caderno et al.
5. M. Kass, A. Witkin, and D. Terzopoulos. Active contour models. International Journal of Computer Vision, 1(2):321–331, 1988. 6. J. Canny. A computational aproach to edge-detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986. 7. B. Kiss K. Polak, G. Dorner and col. Evaluation of the zeiss retinal vessel analyser. British Journal of Ophthalmology, 84:1285–1290, 2000. 8. V. Leborán A. Mosquera and col. Art-vena: Medida del calibre vascular retiniano. Technical report, Departament of Electronics and Computation, USC, 2001.
Image Registration in Electron Microscopy. A Stochastic Optimization Approach J.L. Redondo, P.M. Ortigosa, I. García, and J.J. Fernández Dpt. Computer Architecture and Electronics University of Almería. 04120 Almería. Spain
[email protected]
Abstract. Electron microscope tomography allows determination of the 3D structure of biological specimens, which is critical to understanding their function. Prior to the 3D reconstruction procedure, the images taken from the microscope have to be properly registered. Traditional alignment methods in this field are based on a phase residual function that is minimized by inefficient exhaustive search procedures. This work addresses this minimization problem from a global optimization perspective. A stochastic multimodal optimization algorithm has been applied and evaluated for the task of image registration in this field. This algorithm has turned out to be a promising technique alternative to the standard methodology. The alignments that have been found out show high levels of accuracy, while reducing the number of function evaluations by a significant factor with respect to the standard method.
1
Introduction
Electron microscope (EM) tomography allows the investigation of structure of specimens over different levels of detail, from subcellular domain up to atomic resolution [1,2]. Structural information is essential for the interpretation of the biological function. In EM tomography, a set of images taken from the specimen at different tilts and orientations is combined by 3D reconstruction methods to yield the structure. Since the acquired EM images are not mutually aligned, a registration procedure [3,4,5] has to be applied prior to the reconstruction process. The standard alignment method in this field is based on the minimization of an error function by means of inefficient exhaustive search procedures [6]. In this work, the problem of image registration is addressed in a global optimization framework. The error function involved in the alignment is used as an objective function to be minimized by stochastic optimization. This is a multimodal optimization problem as the number of local minima may be large. In this work, a stochastic multimodal optimization algorithm, the Universal Evolutionary Global Optimizer (UEGO, [7]), is evaluated. In multimodal optimization, the optimizer must be able to find out the global optimum under the presence of many deceptive optima. Therefore, time should be spent in discovering new and promising regions rather than exploring the same region multiple times. UEGO uses a non-overlapping set of clusters which define A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 141–149, 2004. © Springer-Verlag Berlin Heidelberg 2004
142
J.L. Redondo et al.
sub-domains of the search space. As the algorithm progresses, the search process can be directed towards smaller regions by creating new sets of non-overlapping clusters defining smaller sub-domains. This process is a kind of cooling method similar to simulated annealing. A particular cluster is not a fixed part of the search domain; it can move through the space as the search proceeds. The nonoverlapping property of the set of clusters is maintained however. UEGO is abstract in the sense that the ‘cluster-management’ and the cooling mechanism has been logically separated from the actual optimization algorithm. Therefore it is possible to implement any kind of optimizer to work inside a cluster. In this work, a stochastic hill climber [8] has been used.
2
Description of the Optimization Algorithm: UEGO
A key notion in UEGO is that of a species. A species would be equivalent to an individual in a usual evolutionary algorithm. A species can be thought of as a window on the whole search space (Figure 1). This window is defined by its center and a positive radius. The center represents a solution.
Fig. 1. Concept of species
This definition assumes a distance defined over the search space. The role of this window is to ‘localize’ the optimizer that is always called by a species and can ‘see’ only its own window, so every new sample is taken from there. This means that any single step made by the optimizer in a given species is no larger than the radius of the given species. If the value of a new solution is better than that of the old center, the new solution becomes the center and the window is moved while it keeps the same radius value. The radius of a species is not arbitrary; it is taken from a list of decreasing radii, the radius list, that follows a cooling schedule. The first element of this list is the diameter of the search space. If the radius of a species is the ith element of the list, then the level of the species is said to be Given the smallest radius and the largest one and the radii in the list are expressed by the exponential function
Image Registration in Electron Microscopy
143
The parameter levels indicates the maximal number of levels in the algorithm, i.e. the number of different ‘cooling’ stages. Every level (with has a radius value and two maxima on the number of function evaluations (f.e.) namely (maximum f.e. allowed when creating new species) and (maximum f.e. allowed when optimizing individual species). During the optimization process, a list of species is kept by UEGO. This concept, species-list, would be equivalent to the term population in an evolutionary algorithm. UEGO is in fact a method for managing this species-list (i.e. creating, deleting and optimizing species). The maximal length of the species list is given by max_spec_num (maximum population size). Input parameters In UEGO the most important parameters are those defined at each level: the radii and the maximum number of function evaluations for species creation and optimization These parameters are computed from some usergiven parameters that are easier to understand: evals
levels max_spec_num min_r
(N): The maximal number of function evaluations for the whole optimization process. Note that the actual number of function evaluations is usually less than this value. The maximum number of levels (i.e. cooling stages). (M): The maximum length of the species list or the maximum allowed population size. The radius that is associated with the maximum level, i.e. levels.
The reader is referred to [7] for an in-detail description of these input parameters and their relationship to the parameters at each level. The Algorithm The UEGO algorithm is described as:
In the following, the different key stages in the algorithm are described:
144
J.L. Redondo et al.
Init_species_list: A new species list consisting of one species with a random center at level 1 is created. Create_species(evals): For every species in the list, random trial points in the ‘window’ of the species are created, and for every pair of trial points the objective function is evaluated at the middle of the section connecting the pair (see Figure 2). If the value in the middle point is worse than the values of the pair, then the members of the pair are inserted in the species list. Every newly inserted species is assigned the actual level value (i).
Fig. 2. Creation procedure
As a result of this procedure the species list will eventually contain several species with different levels (hence different radii). The motivation behind this method is to create species that are on different ‘hills’ so ensuring that there is a valley between the new species. The parameter of this procedure (evals) is an upper bound of the number of function evaluations. Note that this algorithm needs a definition of section in the search space. In terms of genetic algorithms, it could be thought that, in this procedure, a single parent (species) is used to generate offspring (new species), and all parents are involved in the procedure of generating offspring. Fuse_species(radius): If the centers of any pair of species from the species list are closer to each other than the given radius, the two species are fused (see Figure 3). The center of the new species will be the one with the better function value while the level will be the minimum of the levels of the original species (so the radius will be the largest one).
Image Registration in Electron Microscopy
145
Fig. 3. Fusion procedure
Shorten_species_list(max_spec_num): It deletes species to reduce the list length to the given value. Higher level species are deleted first, therefore species with larger radii are always kept. For this reason one species at level 1 whose radius is equal to the diameter of the search domain always exists, making it possible to escape from local optima. Optimize_species(budget_per_species): It executes the optimizer (in this paper: SASS) for every species with a given number of evaluations (budget_per_species) (see Figure 1). At level the budget_per_species is computed as i.e., it depends on the maximum species number or maximum population size. Note that UEGO may terminate simply because it has executed all of its levels. The final number of function evaluations thus depends on the complexity of the objective function. This behavior is qualitatively different from genetic algorithms, which typically run up to a limit on the number of function evaluations.
3
Image Registration in Electron Microscope Tomography
The combination of information from different EM images taken with the specimen at different orientations is central to 3D reconstruction [1]. These starting EM images are not, in general, mutually aligned from the acquisition. Formulated in the Fourier domain, this fact results in that the initial images have arbitrary phase origins. An essential step before the reconstruction is a proper registration or alignment so that the images have a common phase origin, i.e. they are oriented accordingly. Only then, the 3D reconstruction process can be performed to derive the structure of the specimen under study. In practice, EM images are severely corrupted by noise, deformations and other measurement errors which turn the alignment into an optimization problem. In this work, we have focused in electron crystallography, an unique approach in EM that is capable of reaching atomic resolution by using crystallized specimens [9]. Here, the images are processed and combined using a discrete number of Fourier components extracted from them [10]. In this approach, the need of image registration is limited to translational image alignment.
146
J.L. Redondo et al.
Traditionally, in this field alignment has been performed by minimizing a phase residual function which is directly derived from the spatial shift property of the Fourier transform [6]:
where
and are the phase shifts (in degrees) along X and Y axes, respectively; and are the phase of the frequency component of the reference image(s) and the image to be aligned, respectively; and is the total number of frequency components involved in the comparison. The alignment of a new image with respect to a reference image (or a set of images) essentially consists of determining the global minimum of the phase residual function. This minimization is traditionally carried out by means of an inefficient exhaustive search procedure that evaluates all the possible shifts in a discrete search space. The alignment of the new image is then accomplished by applying the phase shift with the lowest phase residual. Fig. 4 shows an example of alignment with images from connector [11] obtained by electron microscopy of two-dimensional crystals. The reference image is shown in Fig. 4(a) and an image to be aligned in Fig. 4(b). The phase residual map in Fig. 4(c) shows the PR function with 361 × 361 samples at intervals of 1° in either direction X or Y. The upper left corner of the map corresponds to a shift of The lower right corner corresponds to White levels in the error map represent low PR values, whereas black levels denote high PR values. The search for the global minimum was performed with a precision of 0.1°, and was found at (162.5, –113.1). Fig. 4(d) shows the average image resulting from the merging of the aligned image and the reference.
Fig. 4. Traditional alignment method. (a) Reference image. (b) Image to be aligned. (c) Phase residual map. (d) Average image resulting from the alignment of image in (b) and merging with the image in (a).
In practice, the standard alignment method carries out this exhaustive search process hierarchically. First, the whole search space (360° in X and 360° in Y) is
Image Registration in Electron Microscopy
147
discretized at intervals of 3° and the minimum is then sought. Afterwards, the search is progressively refined around that minimum with decreasing resolutions (1.0°, 0.1°, etc). At the end, this approach is able to yield the global optimum with a precision of 0.01° using 43200 evaluations. This heuristic multi-level search approach works because the phase residual function does not exhibit extremely sharp local minima.
4
Evaluation of UEGO for Image Registration in EM
In this section experimental results on EM images will be presented. We tested and evaluated the proposed approach over five different EM images of connectors obtained from untilted two-dimensional crystals [11]. Due to the stochastic nature of UEGO, all the numerical results given in this work are average values of a hundred executions, obtaining a statistic ensemble of experiments. From this data set, average values of the different metrics (number of function evaluations, phase shifts, phase residual, etc.) and the corresponding confidence intervals at 95% were computed. The parameters of UEGO were tuned for the registration problem. A robust parameter setting was found according to the guidelines already stated [7], The optimal values consisted of a maximum number of evaluations of N = 300000, M = 30 clusters, levels, and as the radius at the maximum level. The assessment of UEGO was based on all possible alignments between pairs of images from the set of five experimental EM images, which resulted in 20 tests. For comparison’s sake, both methods, UEGO and the traditional one, were evaluated in terms of number of function evaluations, shifts and phase residual. Average results and confidence intervals for UEGO were computed considering the 20 tests and 100 executions per test. The number of evaluations resulted in 24930 ± 217. The heuristic rule for the standard method was used, resulting in 43200 evaluations in all the alignments. The differences in phase residual between the results of both methods were negligible. Accordingly, the phase origins found by both proved to be very close, with differences in the order of 0.001°. Table 1 summarizes the results obtained with UEGO. The first row of the table indicates the index of the image that was used as a reference in the tests. The corresponding column then presents the results of the alignment of the remaining images with respect to the reference. The results consist of the number of function evaluations that was finally required (‘FE’), and the phase residual function at the global minimum (‘PR’). Note that the alignment of two images should yield the same results, whatever image is used as the reference. However, as UEGO is a stochastic algorithm, the number of function evaluations may not be exactly the same. That can be observed in Table 1. For instance, see the alignment of image #1 with respect to image #2, and the one of image #2 with respect to image #1. The phase residual at the global minimum is 32.32° in both cases, but the number of evaluations is slightly different. The traditional strategy in EM to assess the quality of an image [10] consists of comparing an average image to all the images taking part of the averaging.
148
J.L. Redondo et al.
The average projection image has better signal-to-noise ratio than the images by themselves, hence it represents a more reliably projection of the specimen. In order to further compare both alignment methods, that strategy was applied to assess the alignment of the images. An image was selected and centered according to symmetry criteria [11] (see Figure 4(a)). The remaining images were then aligned with respect to it and averaged all together. The registered images were then compared to the average projection in terms of phase residual. This scheme was applied for both alignment methods, obtaining the same results:
5
Conclusions
In this work the application of UEGO to image translational alignment in electron microscopy has been evaluated and compared with the standard methodology. The results allow us to draw the conclusion that UEGO turns out to be an efficient optimizer for this problem, being able to reduce the number of evaluations by a factor around 45% with respect to the traditional method. The value and the precision of the phase origins found by UEGO are essentially the same as obtained by the standard method, hence the differences in phase residual are negligible. Moreover, the standard strategy to assess the quality of the images also confirms that the alignments performed by both methods have the same quality. Therefore, we can conclude that UEGO exhibits the same performance as the standard method in terms of phase residual. As far as the number of evaluations and the computation time are concerned, UEGO clearly outperforms the traditional method. This proves to be a significant advantage in electron microscope tomography since there may be several hundreds of images involved in a 3D reconstruction. Therefore, there is a substantial
Image Registration in Electron Microscopy
149
computation time spent in mutual image alignment and UEGO could then help to considerably reduce this burden. Acknowledgments. The authors would like to thank Dr. J.M. Valpuesta for kindly providing the EM images of connectors of This work is supported by the Spanish CICYT (grant TIC2002-00228) and Fundacion BBVA (grant BIO-043).
References 1. Baumeister, W., Steven, A.: Macromolecular electron microscopy in the era of structural genomics. Trends Biochem. Sci. 25 (2000) 624–631 2. Sali, A., Glaeser, R., Earnest, T., Baumeister, W.: From words to literature in structural proteomics. Nature 422 (2003) 216–225 3. Brown, L.: A survey of image registration techniques. ACM Comp. Surveys 24 (1992) 325–376 4. Maintz, J., Viergever, M.: A survey of medical image registration. Medical Image Analysis 2 (1998) 1–36 5. Zitova, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21 (2003) 977–1000 6. Grant, R., Schmid, M., Chiu, W., Deatherage, J., Hosoda, J.: Alignment and merging of EM images of frozen hydrated crystals. Biophys. J. 49 (1986) 251–258 7. Ortigosa, P., García, I., Jelasity, M.: Reliability and performance of UEGO, a clustering-based global optimizer. J. Global Optim. 19 (2001) 265–289 8. Solis, F., Wets, R.: Minimization by random search techniques. Mathematics of Operations Research 6 (1981) 19–30 9. Nogales, E., Wolf, S., Downing, K.: Structure of the alpha beta tubulin dimer by electron crystallography. Nature 391 (1998) 199–203 10. Walz, T., Grigorieff, N.: Electron crystallography of 2D crystals of membrane proteins. J. Struct. Biol. 21 (1998) 142–161 11. Valpuesta, J., Fernandez, J., Carazo, J., Carrascosa, J.: The 3D structure of a DNA translocating machine at 10 Å resolution. Structure 7 (1999) 289–296
Evolutionary Active Contours for Muscle Recognition A. Caro1, P.G. Rodríguez1, M.L. Durán1, J.A. Ávila1, T. Antequera2, and R. Palacios3 1
University of Extremadura, Dept. Informática, Escuela Politécnica, Av. Universidad s/n, 10071 Cáceres, Spain {andresc, pablogr, mlduran}@unex.es
2
University of Extremadura, Tecnología de los Alimentos, Facultad Veterinaria, Av. Universidad s/n, 10071 Cáceres, Spain
[email protected]
3
Hospital Universitario “Infanta Cristina”, Servicio de Radiología, Badajoz, Spain
Abstract. Active Contours constitute a widely used Pattern Recognition technique, consisting in curves to be moved by minimizing an energy function. Classical active contours are based on different methodologies. This paper proposes a new procedure using Genetic Algorithm to minimize the energy function of the Active Contour. Both techniques are combined to produce a new methodology, called Evolutionary Active Contours. An application has been developed in order to verify the validation of the proposed method. The valid performance of this approach has been demonstrated with both synthetic and real Magnetic Resonance images. Moreover, the algorithm has been used to recognize muscles from Magnetic Resonance images of Iberian ham at different maturation stages in order to calculate their volume change. Our main findings can be summarized as two: The feasible combination of genetic algorithms and active contours, and the automation of optimal ripening points for Iberian ham samples.
1 Introduction Over the past few decades, a large number of techniques based on principles of evolution and heredity have been developed. Such systems maintain a population of potential solutions. They have some type of selection process based on fitness of individuals, and some genetic operators. Some of the best known evolutionary methods are Evolutionary Programming [8], Genetic Algorithm [9], and Genetic Programming [12]. There exist several important optimization problems for which such algorithms have become available. In addition, any problem may be solved by genetic algorithms, since such a problem can be perceived as a search through a space of potential solutions. Searching for the best solution is considered as an optimization process. Thus, these methods can be used in conjunction with other methodologies to find the optimal solutions. Genetic algorithms are developed to solve tasks in Computer Vision techniques, particularly in optimization processes.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 150–157, 2004. © Springer-Verlag Berlin Heidelberg 2004
Evolutionary Active Contours for Muscle Recognition
151
Active Contours (or Snakes) have been used in a wide array of applications over the past few years. Energy-minimizing active contour models were proposed by Kass et al. [10]. They consist in parameterized curves that can be moved under the influence of internal and external forces (energy function), which are computed by using a variational calculus approach for an appropriate solution identification. Dynamic programming has been used for minimizing the functional energy [1]; in addition, new models based on greedy algorithms [16] and gradient flows [7, 11, 17] have also been proposed. Energy minimizing can be achieved by using evolutionary methodologies, and by combining Active Contours with Genetic Algorithms. A genetic algorithm (evolutionary active contour) has been developed; it deforms the active contour by allowing pattern recognition to use adapted versions of the classical genetic operators (mutation and crossover). The evolutionary active contours have been tested in a data set of synthetic images. Furthermore, Iberian ham Magnetic Resonance images (MRI) have been processed in order to reach conclusions on ham products. MRI provides great capabilities for noninvasive examination within the organs as it produces multiple digital images of the tissue. The ripening of Iberian ham constitutes a long process, in which physical-chemical and sensorial methods are required to evaluate different parameters. These techniques are tedious, destructive and expensive [2]. Traditionally, the maturation time is fixed when the weight loss of the ham is approximately 30% [6]. So, other methodologies have long been awaited by the Iberian ham industries. Computer Vision techniques, which study volume reductions, [4, 5] may embody one of those solutions. As a result, the evolutionary active contour approach is firstly described in the next section. The practical application of this method (section 3) to determine the volume changes during the ripening process of the Iberian ham, processing MR images, is the second contribution of this work.
2 Methodology Evolutionary active contour techniques are means to solve the optimization problem of active contours by using genetic algorithms. In this section, a review of these methodologies as well as the conjunction of both techniques are presented.
2.1 Genetic Algorithms Genetic Algorithms maintain a population of individuals for each iteration [14]. Each individual represents a potential solution to the problem, and is evaluated to give some measure of its fitness. Then, a new population is formed by selecting the fittest individuals (selection step). Some members of the new population undergo transformations (alteration step) by means of genetic operators to form new solutions. There are unary transformations (mutation type), which create new individuals by a small change in a single individual. Likewise, there are higher order transformations (cross-
152
A. Caro et al.
over type), which create new individuals by combining parts from several (two or more) individuals. After some generations, the program converges. It is hoped that the best individual represents a reasonable solution. The structure of a genetic algorithm is shown in Figure 1.
Fig. 1. Structure of a genetic algorithm
The initial population either generated randomly or created as a result of some heuristic process is the starting point for the genetic algorithm. The evaluation function returns the fitness of the population, distinguishing better from worse individuals. Several mutation operators can be designed which would transform the population. A few crossover operators combining the structure of two (or more) individuals into one may be considered.
2.2 Active Contours Deformable Models (Active Contours, or Snakes) are curves that can be moved due to the influence of internal and external forces [3]. The forces are defined so that the snake can detect the image objects of interest. Active Contours are defined by an energy function. By minimizing this energy function, the contour converges, and the solution is achieved. An Active Contour is represented by a vector, v, which contains all of the n points of the snake. The functional energy of this snake is given by:
is the internal energy of the contour. It consists of continuity energy plus curvature energy represents the proper energy of the image, which is quite different from one image to another. and are values chosen to control the influence of the three terms [13, 15, 17].
Evolutionary Active Contours for Muscle Recognition
153
2.3 Evolutionary Active Contours By combining the two techniques above, a new approach has been designed. This method has been called Evolutionary active contours, and consists in using genetic algorithms to minimize the energy function of the active contours. The structure of the evolutionary active contour is exactly the same of that shown in Figure 1. The active contour, represented by the v vector, contains all the n points of the snake. All the n points of the snake (the whole v vector) constitute the population of the genetic algorithm to be performed (each point is an individual). This vector forms the initial population of the genetic algorithm (randomly generated). Some points of the snake (individuals of the population) are better than others. The evaluation of these potential solutions is achieved by computing the energy function of the active contour. The fitness function is the snake energy function, which is applied over all the points of the snake and their respective neighborhoods. This is the way by which the new population is formed, i.e., by selecting the best individuals (“select” step). The individuals of the new population are modified (“alter” step) by the mutation and crossover operators. As the snake points are coded as (x,y) coordinates, all these points (i.e., the population) are reordered for each generation to prevent snakes from crossing. Crossover combines the features of two parent individuals to form two similar offspring by swapping corresponding segments of the parents, exchanging information between different potential solutions. Three different crossover operators have been developed in the evolutionary active contour algorithm. Two of them exchange the X or Y coordinates of the individuals, selecting consecutive individuals in vector v (crossover 1) or alternative individuals in vector v separated by the same given distance (crossover 2), as Figure 2 shows.
Fig. 2. Crossover by swapping the X coordinate
Figure 3 shows the developed crossover 3, in which the coordinates (X or Y) of the individuals are converted into binary strings, swapping segments of these strings. By changing the least significant bits (the bits on the right of the string), new individuals will be placed near parents. The mutation operator alters the individuals, introducing extra variability into the population. Two different mutation operators have been developed in the evolutionary active contour algorithm. The first one changes the selected individual into another one placed at different square neighborhoods, as Figure 4 illustrates. The second operator works with the binary coordinates, selecting a position on this string and swapping the value, as may be seen in Figure 5. Again, all the changes are made over less
154
A. Caro et al.
significant bits (bits on the right of the string) in order to produce new individuals located near parents.
Fig. 3. Crossover by swapping the X coordinate, considering individuals as binary string
Fig. 4. Mutation of one individual, from the initial position (center of the square) to another in the square neighborhood. One of the eight possible positions is selected as the new individual. (a) Considers a 3x3 neighborhood and (b) considers a 5x5 neighborhood
Fig. 5. Mutation of one individual, converted into binary string
3 Results and Discussion Our core research is based on a data set of synthetic images as well as MRI sequences of Iberian ham images. Fifty synthetic images have been designed for testing the validation of the evolutionary active contour technique. Six Iberian hams have been scanned over three different stages during ripening time. The MRI volume data set is obtained from sequences of T1 images with a 120x85mm-FOV (field-of view) and a 2 mm-slice thickness, i.e. a voxel resolution of 0.23x0.20x2 mm. The total number of
Evolutionary Active Contours for Muscle Recognition
155
real MR images in the database is 252. Figure 6 shows examples of some synthetic images (with the final snake), while Figure 7 contains real MR images with the final snake.
Fig. 6. Example of synthetic images used in the experiments, including the final snakes.
As soon as the evolutionary algorithm was validated by the data set of synthetic images, it was used to study the evolution of the Iberian ham muscles during the ripening process. The objective was to study how the hams evolve over the maturation process. Our main findings point out that the evolutionary active contours seem to work as a suitable solution for pattern identification (muscles in the proposed practical application). The biceps femoris muscle has been satisfactorily recognized, matching the images of the database. In average, the error made has been estimated at less than 10%, considering the manual expert delineation of the muscles compared with the final area of the snake-segmented muscle.
Fig. 7. Illustration of three Iberian ham MR images, which include the detection of the muscles
The practical application of the proposed algorithm shows the volume reduction of the Iberian ham biceps femoris during its ripening stages. The results presented in Figure 8 (obtained with the evolutionary active contours) demonstrate a size reduction of up to 30% as an average score going from the initial stage (raw) to the second phase (semi-dry). When comparing the semi-dry and dry-cured stages, the average decrease is perceived as approximately 20%. The average ratio is about 50% at the end of the maturation process, 21 months after the initial stage. It is extremely significant to compare the computer vision results (a reduction by 50%) with the traditional methods. Food Technology specialists have estimated the total weight decrease in the Iberian ham to be 30% during the same time. Thus, a relationship between the ham weight decrease (30%) and muscle reduction (50%)
156
A. Caro et al.
is carried out in our approach: Weight decreases may be caused by the loss of water during maturation time. Optimal ripening time could not be the same for different Iberian hams. In addition, by studying the percentage rate of volume reduction during the ripening process, it was possible to predict the optimal maturation point.
Fig. 8. Biceps femoris size reduction during ripening time in Iberian hams. Muscle size has been determined by applying evolutionary active contours
4 Conclusions This paper has described a new approach for object boundary detection which combines genetic algorithms and active contour models. In addition, robustness in the optimization process computation has been confirmed. The feasibility of using the proposed algorithm has been demonstrated via synthetic images. Moreover, the practical viability of applying Computer Vision techniques, in conjunction with MRI, to automate optimal ripening time in Iberian hams, constitutes a key finding in our research. Such computer vision techniques may introduce new and alternative methods for future work, which together with traditional chemical processes, may serve to evaluate different physical properties that characterize the Iberian ham.
Acknowledgments. The authors wish to acknowledge and thank the support of the “Dehesa de Extremadura” brand name and the “Hermanos Alonso Roa” company from Villar del Rey (Badajoz) to our study. In addition, this research has been funded by the Junta de Extremadura (Regional Government Board) under the IPR98A03Pand 2PR01C025-labeled projects.
Evolutionary Active Contours for Muscle Recognition
157
References 1.
2.
3. 4.
5.
6.
7. 8. 9. 10. 11.
12. 13.
14. 15. 16. 17.
Amini, A.A., Weymouth, T.E. and Jain, R.: Using Dynamic Programming for Solving Variational Problems in Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, 855-867, (1990) Antequera, T., López-Bote, C.J., Córdoba, J.J., García, C., Asensio, M.A., Ventanas, J. and Díaz, Y.: Lipid oxidative changes in the processing of Iberian pig hams, Food Chem., 54, 105, (1992) Blake, A. and Isard, M.: Active Contours. Springer, London – UK, (1998) Caro, A., Rodríguez, P.G., Cernadas, E., Durán, M.L., Antequera, T.: Potencial Fields,as an External Force and Algorithmic Improvements in Deformable Models, Electronic Letters on Computer Vision and Image Analisys, Vol. 2(1), 23-34 (2003) Caro, A., Rodríguez, P.G., Ávila, M., Rodríguez, F., Rodríguez, F.J.: Active Contours Using Watershed Segmentation, IEEE Int. Workshop on Systems, Signal and Image Processing, Manchester - UK, 340-345, (2002) Cava, R. and Ventanas, J., Dinámica y control del proceso de secado del jamón ibérico en condic. naturales y cámaras climatizadas, T. jamón ibérico, Mundi Prensa, 260-274, (2001) Cohen, L.D.: On Active Contour Models and Balloons, Computer Vision, Graphics and Image Processing: Image Understanding, Vol. 53(2), 211-218, (1991) Fogel, L.J., Owens, A.J. and Walsh, M.J.: Artificial Intelligence through Simulated Evolution, John Wiley, New York, (1966) Holland, J.: Adaptation in Natural and Artificial Systems, Univ. of Michigan Press, (1975) Kass, M., Witkin, A. and Terzopoulos, D.: Snakes: Active Contour models, Proceedings of First International Conference on Computer Vision, London, 259-269, (1987) Kichenassamy, S., Kumar, A., Olver, P., Tannenbaum, A. and Yezzi, A.: Gradient Flows and Geometric Active Contour Models, Proc. Int. Conf. on Computer Vision, Cambridge, (1995) Koza, J: Genetic programming: On the programming of computers by means of natural selection. MIT Press, Cambridge, MA, (1992) Larsen, O.V., Radeva, P. and Martí, E.: Guidelines for Choosing Optimal Parameters of Elasticity for Snakes, Proc. Int. Conf. Computer Analysis and Image Proces., 106-113, (1995) Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) Ranganath, S.: Analysis of the effects of Snake Parameters on Contour Extraction, Proc. Int. Conference on Automation, Robotics, and Computer Vision, 451-455, (1992) Williams, D.J. and Shah, M.: A Fast Algorithm for Active Contours and Curvature Estimation, C. Vision, Graphics and Im. Proc.: Im. Understanding, Vol. 55, 14-26, (1992) Xu, C. and Prince, J. L.: Gradient Vector Flow: A New External Force for Snakes, IEEE Proc. on Computer Vision and Pattern Recognition, (1997)
Automatic Lane and Band Detection in Images of Thin Layer Chromatography António V. Sousa1,2, Rui Aguiar3, Ana Maria Mendonça1,4, and Aurélio Campilho1,4 1
Instituto de Engenharia Biomédica Rua Roberto Frias, 4200-465 Porto, Portugal 2 Instituto Superior de Engenharia do Porto Rua Dr. António Bernardino de Almeida 431, 4200-072 Porto, Portugal
[email protected] 3
Instituto de Biologia Molecular e Celular Rua do Campo Alegre, 823, 4150-180 Porto – Portugal 4
Faculdade de Engenharia da Universidade do Porto Rua Roberto Frias, 4200-465 Porto, Portugal {amendon, campilho}@fe.up.pt
Abstract. This work aims at developing an automatic method for the analysis of TLC images for measuring a set of features that can be used for the characterization of the distinctive patterns that result from the separation of oligosaccharides contained in human urine. This paper describes the methods developed for the automatic detection of the lanes contained in TLC images, and for the automatic separation of bands for each detected lane. The extraction of quantitative information related with each band was accomplished with two methods: the EM expectation-maximization and nonlinear least squares trustregion algorithms. The results of these methods, as well as additional quantitative information related with each band, are also presented.
1 Introduction This work aims at developing an automatic method for the analysis of TLC images for measuring a set of features that can be used for the characterization of the distinctive patterns that result from the separation of oligosaccharides contained in human urine. This data will be the training set for the classification of pathological or normal cases. The separation of materials based on TLC (Thin-Layer Chromatography) is used as a mean for the diagnosis of lysosomal pathologies that can be identified by the presence of oligosaccharides in the patient’s urine [1-2]. The chromatographic separation process occurs in the thin interlayer of the TLC plate. The application of this technique produces a qualitative pattern of excretion of the above mentioned compounds, evidenced by the presence of bands along the lane, each band A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 158–165, 2004. © Springer-Verlag Berlin Heidelberg 2004
Automatic Lane and Band Detection in Images of Thin Layer Chromatography
159
corresponding to a specific oligosaccharide. According to the observed pattern and, taking into account the relative intensities of the bands, it is normally possible to get evidence about the pathological state of the patient. The diversity of compounds present in human urine makes the separation and analysis of the oligosaccharides very complex. Recently, new methods have been proposed to overcome these difficulties [3-4]. However, TLC has developed into an instrumental method today which still plays an important role, in particular, for the separation of substances with different polarities. TLC plate images usually contain several lanes; each lane can be used for the separation of a patient’s urine, or can be used to decompose a specific oligosaccharide that will serve as a reference marker for that plate. Image processing techniques have already been used to address the problem of TLC images analysis. Machado A. et al, describe in [12] an iterative algorithm for the automatic segmentation of lanes. The proposed algorithm is based on the projection of the grey-level intensities onto the horizontal direction and on the periodic properties of the intensity profile. Recently, some companies developed image analysis software that automates the detection of lanes and bands in this type of images [13-17]. This paper describes the methods developed for the automatic detection the lanes, and the automatic separation of bands for each detected lane. The extraction of additional quantitative information related with each band is also presented. The next section describes the procedure for automatic lane and band detection. The features extraction methods with the EM Expectation-Maximization and nonlinear least squares Trust-region algorithms are explained in section 3. Experimental results for the evaluation of the proposed method are presented in section 4 and conclusions are derived in the final section.
2 Automatic Lane and Band Detection The first processing task consists in detecting the lanes in a TLC plate image. Each lane is composed by a set of bands that will allow the identification of the class which it belongs to. The main objective is to extract a meaningful set of features from each lane. In each TLC plate there are up to a maximum of 9 lanes, where at least one is considered as a reference lane (carbohydrate lactose is normally used as reference marker). The position of this marker defines the relevant position for all the bands in the lanes. The processing flow involves three phases that are described in next subsections.
2.1
Image Acquisition and Pre-processing
TLC plates are digitized through light reflection with a scanner. The results are RGB images, represented with 24 bits per pixel. In these images, lanes have a vertical orientation, with a bottom-up direction of flow. The first processing operations are intended to automatically extract the image region of interest (ROI) that contains the lanes. First, the original RGB image is converted to
160
A.V. Sousa et al.
a greyscale intensity image. Then, with the use of differential filters sensitive to vertical and horizontal directions, it is possible to locate the greatest transitions that occur at image limits, allowing the detection of the four lines that surround the image area where relevant information is contained. Before initiating the lane detection task, the greyscale image is filtered using a lowpass gaussian filter aiming at removing the background noise always present in this type of images.
2.2
Lane Detection
In order to detect the position of the lanes, image intensity information is projected onto the horizontal direction. Although in [12], the maximum value in each row is suggested as a possible projection criterion, we found out that the conventional projection based on the sum of intensities along the column is better as the discrimination between large bands in neighbouring lanes is increased. This situation is shown in figure 1, where the separation between the three upper bands on the leftmost lanes would be very difficult. Because the profile of the projection is usually very irregular, a smoothing operation is required. The extreme points of the resulting curve are a good indication of the presence of the lanes, as well as their central positions. In fact, each lane can be limited by two maximal points of the curve, while the points where the projection is minimal are associated with lane central positions. Figure 1 presents an original ROI (on the left) with the detected lane central positions overlapped; these positions were determined based on the intensity profile shown on the right side of the figure.
Fig. 1. Automatic lane detection based on the mean intensity profile of the ROI: original ROI with detected lane central positions (left) and intensity profile of the ROI projection onto the horizontal direction (right).
Automatic Lane and Band Detection in Images of Thin Layer Chromatography
2.3
161
Band and Pattern Identification
After identification of the lanes present in the ROI, each lane is analysed in order to locate its bands and to extract band features such as position, area, width and maximum intensity. For each lane, only the area that surrounds its central position is used for calculation purposes (10% for each side of the lane central position). In order to extract the characteristic values of the bands, the selected area is projected onto the vertical direction to obtain a lane intensity profile. After smoothing this profile, the minimal points of the curve that have intensities bellow a threshold value are used to define band central positions. The threshold is derived from the standard deviation of the differences between the real and the smoothed profiles. Band limits are associated with the maximal values that occur to each side of the central point. The next step is the location of the reference lane to identify the band associated with the lactose, as it will be used as a reference for all bands in other lanes. For each lane, the lane starting point, as well as the lane baseline also need to be estimated. The lane baseline is the line that links the maximal points of the profile where there is a good separation between two consecutive bands. This baseline can be understood a characterization of the background intensity on each line point. This line is obtained by a morphological closing operation on the smoothed lane intensity profile. The length of linear structuring element is equal to the maximal distance between the points of the lane profile that have the highest intensities. Figure 2, represents the profile of the lane shown in the left, as well as the corresponding baseline (dotted line).
Fig. 2. Mean intensity smoothed profile as well as the corresponding baseline (dotted line) for the lane shown in the left.
162
A.V. Sousa et al.
3 Band Feature Extraction At the end of the chromatographic process, it is possible to get different shapes for the intensity profiles of the bands, depending on several factors such as the diffusion processes and statistical irregularities. When the band peaks broaden out almost symmetrically and can be adapted to a gaussian curve, the chromatographic process is called linear and non-ideal chromatography [6]. In this paper, we consider that all chromatographic images have these properties. Based on this assumption, when band separation is not clear, we consider that the resulting intensity profile corresponds to a sum of two or more gaussian functions (it is a mixture of gaussians). This situation is illustrated in the lower segment of the profile of fig. 2, where it is possible to observe three peaks that can be associated with to the sum of three gaussian functions. The number of gaussian functions to be estimated depends on the number of band peaks. In a gaussian mixture model, a probability density function is defined as a linear combination of M gaussian functions: where P(i) are the mixing coefficients (or prior probabilities) and is the density function of component i. The density function of component i is a gaussian density function with mean and variance defined by:
To estimate the parameters of the mixture model, P(i), and two different methods were used: maximization the data likelihood with EM - Expectation-Maximization algorithm; nonlinear least squares method with Trust-region algorithm.
3.1 EM Algorithm The aim of the maximum likelihood estimation is to find a parameter set that maximize the likelihood from given samples. The EM algorithm [5], is an iterative method that generates a sequence of estimates, starting from the initial parameter set At each iteration, two main steps can be identified. In the E-step, the expectation value, of logarithm-likelihood function of complete data conditioned by the observed samples and the current solution. In the M-step, the estimate for the current iteration that maximizes is chosen. For the initialization of the model, the K-Means algorithm was used. The EM algorithm is terminated as soon as the value of the error function does not vary substantially during two consecutive cycles.
Automatic Lane and Band Detection in Images of Thin Layer Chromatography
163
3.2 Nonlinear Least Squares Method and Trust-Region Algorithm In this approach a least squares fitting method was used for estimating the parameters of the gaussian mixture [7]. This algorithm is an iterative procedure, defined by the following four steps: 1. Start with an initial estimate for each parameter; 2. Produce the fitted curve for the current set of parameters; 3. Adjust the function parameters using the trust-region algorithm. 4. Repeat the two last steps until the specified convergence criterion is reached. The main idea behind the trust-region algorithm is that the function to optimize, f , can be approximated by a simpler function, q , in a neighbourhood, N, around the point where the fitting needs to be improved. This approximation is defined by the first two terms of the function Taylor series [8].
4 Experimental Results For evaluating the automatic lane detection method, we have used a sample of 30 images. Usually, three carbohydrates, namely nanose, lactose and raffinose, are used as references. However, because in this study, only the lactose lane is of real interest for band classification, the two other possible references, when present in the plate, are discarded. In order to evaluate the lane detection method, the results for the image test-set are expressed in terms of the confusion matrix presented in table 1.
Figure 3 shows the results for the two methods used for gaussian function separation when applied to the lane shown in fig. 2. In these images the threshold value used for band detection is 10. To determine the goodness of fit, the values of the goodness of fit statistics, R-square (coefficient of determination) and RMSE (root mean squared error), were calculated. The obtained values are 91% and 5.0 for the EM algorithm and 94% and 4.0 for the Trust-region algorithm. From the analysis of each lane, the number of bands (number of detected gaussian function) and some band features derived from the gaussian function parameters were calculated. Table 2, shows a summary of these values for the lane in fig. 2, using the EM method for curve detection. The relative central position of the bands is expressed in relative units to the reference band. The area and width at the base of the band are expressed in pixels.
164
A.V. Sousa et al.
Fig. 3. Curve fit and curve intensity profile for the lane shown in fig. 2. Results for the EM algorithm (left) and nonlinear least squares –Trust-region algorithm (right).
To compare the performance of the EM and Trust-region algorithms for band detection, 7 images with 54 lanes and 202 visible bands were analyzed. The results are resumed in the table 3.
These two algorithms produce good results in the extraction of features from bands. The Trust-region algorithm is faster, and it allows the definition of the search interval. The latter leads to results that are better than those obtained with EM algorithm. In this last method, all available information in the neighbourhood of each band is included, and when the noise is significant, the number of incorrect bands usually increases.
Automatic Lane and Band Detection in Images of Thin Layer Chromatography
165
5 Conclusion and Future Work We have proposed a system that allows the automatic detection of lanes and bands in TLC images. The identification of the bands contained in each lane is based on the adaptation of a set of gaussian functions to the corresponding intensity profile. For this purpose, two methods, EM algorithm and Trust-region algorithm were used. Although both methods produced useful results, it was found out that their results are very sensitive to the correct number of gaussians to include in the mixture. In the future we intend to improve these methods and to define a classification system to achieve the goals proposed in the introduction of this article: to develop an automatic method for the analysis of TLC images that extracts a meaningful set of features from each lane. This will allow the automatic recognition and classification of patterns, in normal or pathological cases, and then in the normal class, we will try to identify new classes.
References 1. Sewell, A.: Urinary Oligosaccharides. Techniques in Diagnostic Human Biochemical Genetics. Willey-Liss, Inc. (1991) 219-231 2. Thomas, G.: Disorders of Glycoprotein Degradation. Fucosidosis and Sialidosis. Disorders of Glycoprotein Degradation (Chap. 140) 3. Starr, C., Masada, R., Hague, C., Klock, J.: Fluorophore-assisted Carbohydrate Electrophoresis in the Separation, Analysis, and Sequencing of Carbohydrates. Journal of Chromatography A, 720 (1996) 295-321 4. Klein, A., Lebreton, A., Lemoine, J., Périni, J., Roussel, P., Michalski, J.: Identification of Urinary Oligosaccharides by Matrix-assisted Laser Desorption Ionization Time-of-flight Mass Spectrometry. Clinical Chemistry (1998) 44:12 2422-2428 5. Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, B 39 (1) 1-38, 1977 6. Schwedt, G.: The Essential Guide to Analytical Chemistry. John Wiley & Sons 7. Dennis, J.: Nonlinear Least-squares, State of the Art in Numerical Analysis. D. Jacobs Academic Press 8. Byrd, R., Schnabel, R., Shultz, G., Approximate Solution of the Trust Region Problem by Minimization over Two-Dimensional Subspaces. Mathematical Programming, Vol. 40, (1988) pp 247-263. 9. Branch, M., Coleman, T., Li Y.: A Subspace, Interior, and Conjugate Gradient Method for Large-Scale Bound-Constrained Minimization Problems. SIAM SISC, Vol. 21, Number 1, pp 1-23, 1999. 10. Nabney, I.: Netlab Algorithms for Pattern Recognition. Advances in Pattern Recognition; (2003) Springer 11. Marques, J.: Reconhecimento de Padrões. Métodos Estatísticos e Neuronais, IST-Press 12. Machado, A., Campos, M., Siqueira, A., Carvalho, O.: An Iterative Algorithm for Segmenting Lanes in Gel Electrophoresis Images, (1997). 13. www.nonlinear.com/products/specifications.asp 14. www.applied-maths.com/bionumerics/features_lay.htm 15. www.herolab.de/e_gelso.html 16. www.fotodyne.com 17. www.mediacy.com
Automatic Tracking of Arabidopsis thaliana Root Meristem in Confocal Microscopy Bernardo Garcia1, Ana Campilho2, Ben Scheres2, and Aurélio Campilho1,3 1
INEB - Instituto de Engenharia Biomédica, Laboratório de Sinal e Imagem Biomédica
2
Department of Molecular Genetics, University of Utrecht, The Netherlands
[email protected] {a.campilho,b.scheres}@bio.uu.nl 3
Universidade do Porto, Faculdade de Engenharia, Dept. Eng. Electrotécnica e de Computadores, R. Dr. Roberto Frias, s/n 4200-465 Porto Portugal
[email protected]
Abstract. The Arabidopsis thaliana is a well defined and a suited system to study plant development at the cellular level. To follow in vivo the root meristem activity under a confocal microscope the image acquisition process was automated through a coherent observation of a fixed point of the root tip. This position information allows the microscope stage control to track the root tip. Root tip estimation is performed following two approaches: computing the root central line intersection with the contour or the maximum filtered contour curvature point. The first method fits the root border with lines, using the Radon transform and a classification procedure. The central line is defined as the line that bisects the angle between these lines. The intersection of the central line with the root contour provides an estimate for the root tip position. The second method is based on contour traversing, followed by convolution of the contour coordinates with a Gaussian kernel. Curvature is computed for this filtered contour. The maximum curvature point provides another root tip estimate. A third method, based on a Kalman estimator is used to select between the previous two outputs. The system allowed the tracking of the root meristem for more than 20 hours in several experiments.
1
Introduction
Most of what we see of plants is derived from postembryonic development and depends largely on the activity of small regions of dividing cells, the meristems. The most important of these are the shoot apical meristem and root apical meristem. The Arabidopsis root contains a simple and almost invariant cellular pattern with single layers of different cell types surrounding a small central vascular bundle. The root meristem located at the tip of the root perpetuates this pattern by cellular division [1]. However, the control of these divisions is not yet understood.To enable the exploration of the regulation of cell division in the Arabidopsis root meristem, the establishment of an in vivo system under a confocal A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 166–174, 2004. © Springer-Verlag Berlin Heidelberg 2004
Automatic Tracking of Arabidopsis thaliana Root Meristem
167
microscope to follow and visualize the root meristem was performed. It is not possible to retrieve images from the meristem for an indefinite amount of time, as the root is growing. This paper presents a process to automate acquisition of root images, modeling the position observability solution as the extraction of root tip position information. Root tip estimation is done through two methods: root central line intersection with contour and maximum filtered contour curvature. The first method fits the root border with lines, using the Radon Transform, and a separation procedure. The central line is defined as the line that bisects the angle between these lines. The intersection of the central line with the root contour provides an estimate for the root tip position. The second method is based on contour traversing, followed by convolution of the contour coordinates with a Gaussian kernel. Curvature is computed for this filtered contour. The maximum curvature point provides another root tip estimate. A third method, based on a Kalman estimator is used to select between the previous two outputs, (the nearest neighbour is selected). This paper is organized as follows: Section 2 explains the process of root tip estimation. Section 3 presents and discusses some results. Section 4 derives the main conclusions.
2
Root Tip Estimation
This section introduces the root tip estimation method as well as its constituting phases: 2.1 defines the process of root detection in the main image; 2.2 clarifies the process used for contour detection; 2.3 explains the method based on the central line intersection with the contour; 2.4 describes the estimate based in the maximum curvature point and 2.5 introduces the method for arriving at a final estimate.
2.1
Root Detection
Both our estimates are based on a reasonably robust contour detection. We use an object detection phase followed by contour extraction. The object detection phase can be summarized as follows: an anisotropic diffusion filter result is thresholded with the Otsu method. A series of morphological operations are then performed, in order to obtain a connected object. To smooth the image, reducing noise and having a moderate dispersion effect on the object borders, we use an anisotropic diffusion filter, which shows good results [2,3]. The anisotropic diffusion filtered image at iteration is given by (1).
where is a discretely sampled image, denotes the pixel position in a discrete, two-dimensional (2-D) grid, and denotes discrete time steps (iterations).
168
B. Garcia et al.
The constant is a scalar that determines the rate of diffusion, represents the spatial neighborhood of pixel s, and is the number of neighbors (usually four, except at the image boundaries). Perona and Malik [4] linearly approximated the image gradient (magnitude) in a particular direction as shown in (2). For a thorough analysis of the theory underlying this method, refer to [5]. The filter parameters used in this work correspond to 5 iterations, and
The
edge-stopping function used in this paper is given in (3).
where is a parameter selected by the user. The resulting image is thresholded via the Otsu method [6] yielding image Other thresholding methods were tested, with better results, in sporadic cases, in terms of contour extraction but this threshold proved to be more reliable and robust. For an extensive review on thresholding methods, refer to [7]. Cells that may have separated from the root may be detected by the threshold and thus become potentially gruesome for further analysis, namely threatening a reliable curvature computation. Moreover, other objects, including other roots may appear in the image. To avoid ambiguity, we assume that the root is the largest object in the image. A region labeling operation is performed yielding several regions being An operation corresponding to largest object selection in is performed to extract the root, using 4–connectivity and all regions except the one with the largest area are discarded as according to (4), yielding
This image is not yet suitable for contour detection, as several holes normally appear inside the object borders – corresponding to the root cells. To remove these holes, we perform a morphological closing with a square structuring element, with side equal to 2% of the horizontal length of the image, followed by a hole removal operation. This operation corresponds to computing The image corresponds to a hole-free object defining the detected location of the root.
2.2
Contour Detection
The contour estimation is based on binary morphological operations. The contour detection procedure is based on subtracting to the original image its erosion as shown in (5). The resulting contour image is subsequently thinned, in order to allow contour traversing.
Automatic Tracking of Arabidopsis thaliana Root Meristem
169
Fig. 1. Intermediate results in contour detection: a) original image; b) image after anisotropic diffusion filtering; c) thresholding phase; d) detected contour.
where the E operator represents the morphological erosion operation. Figure 1 shows an example of main intermediate results for a test image, until the final contour is extracted.
2.3
Central Line Intersection with Contour
In this root tip position estimation approach, we assume that the root tip point is the intersection of the root contour with the root’s central line. We assume therefore that the root borders can be approximated through lines. The central line is defined as the line that bisects the minimal angle between these border lines. Root Border Detection. To detect the lines approximating the borders, the Radon transform is used. First, we compute the Radon transform [8] for orientations between 0 and We then extract two line parameter estimates from this data. This is done through thresholding and clustering. A histogram-based threshold is computed. First, we divide the Radon accumulator range into 128 equally spaced bins. A histogram of the Radon transform is built, fitting the points in these bins. Only the upper range of values is of interest to estimate the line parameters. Therefore, an accumulated histogram is computed and the threshold is extracted, as the value that separates at least 99.9% of the transform area from the 0.1% corresponding to the upper values. This value is successively multiplied by 90% until at least 2 points exist in the threshold set. Note that although we set this area as a fixed parameter, it is iteratively adjusted. The Radon points above this threshold are now selected, yielding 0 elsewhere. From these points, a data set is constructed, to serve as input to the clustering process. To each non-zero point in the threshold is subtracted, yielding a new matrix For each point in a number (equal to the rounded value of of points with features is added to the data set. These points are now clustered hierarchically to yield 2 clusters, using the scaling-invariant Mahalanobis
170
B. Garcia et al.
distance [9,10] and the Ward method [11,10]. With this scheme, the threshold can still possibly select only one group. Nevertheless, the clustering process will still output two different clusters. To prevent such problems, which would result in large deviations from reasonable root tip estimations, a minimum distance between the centroids is required. The value for this minimum distance was chosen experimentally and proved to be robust. If the centroids are not positioned further apart than the minimum distance, the threshold is lowered by 10% and the process is repeated. At the end of 6 iterations without fulfilling the criterion, the process continues - we assume in that case that the lines have similar slopes and are closer - the root is narrower. Cases of nearly vertical lines are treated specially, since the theta dimension of the Radon transform should be wrapped around itself - switching from 0 to only means a sign reversal in terms of center distance. Central Line Parameters Extraction. The central line is the line bisecting the minimal angle between the border lines with parameters The central line’s angle with any of the border lines is the average of these lines’ angles. Its minimal distance to the origin point is calculated forcing it to pass through the intersection of the border lines. These parameters can be extracted using the equations (6,7).
If these lines are approximately parallel, the intersection point is near infinity. We assume that the lines are parallel if their angles’ difference is less than In this case, the parameters are calculated averaging the other lines’ corresponding parameters.
2.4
Maximum Curvature Estimation
This method assumes that the root tip position corresponds to the contour point with maximum curvature. The initial and final points for contour traversing are chosen as the set of points 1 pixel distant from the image border whose distance is maximal. The coordinates are padded with their initial and final values in the initial and final positions, to yield valid values after convolution. This padding approach is suggested in [12]. The and adjusted coordinates are convolved with a Gaussian kernel of unitary width, computed in an interval [–3,3], with a 0.1 step size. This window’s area is unitary, to avoid scaling of coordinates.
Automatic Tracking of Arabidopsis thaliana Root Meristem
171
As curvature is a function of the first and second order derivatives of the coordinates, noise’s high frequency components could be amplified by this differentiation and yield false maxima. Convolution of coordinates with the Gaussian kernel results in a filtered contour, with less high frequency components in each of the coordinates. The curvature of a contour with coordinates and is given by (8):
where the operator represents the operator. Curvature is not defined if both coordinates’ derivatives are null. Thus, we have a discontinuity near the start and end points of the contour. Numerical computation of curvature yields high values near these points. To avoid this problem, a validity region was established. The points excluded are those that are less than 1/3 of the size of the convolution window distant from either the start or the endpoint. After curvature computation in the valid region, the maximum point is selected. The root estimate r corresponds to the coordinate points that is, the points from the original contour having the same contour parameter as
2.5
Final Estimate Calculation
Two methods for root tip estimation have been presented. The method for combining them still remains to describe. In our approach, we use a Kalman estimator to select between the estimations resulting from the previous methods. Our choice is supported by the assumption that the root movement characteristics do not vary significantly between frames. Being so, we calculate an approximate root position, based on a linear combination of previous positions. The selected estimate is the one with least euclidian distance to this predicted position. Our goal with this approach is to lower the probability of losing tracking - that may easily result in a premature cancelation of the experiment. The procedure is to adapt a Kalman filter, based on the previous root positions, to estimate 3 coefficients that serve as weights for linear prediction. Filtering the last 3 positions with these coefficients yields the estimate.
3
Results
After estimation of the root tip, the microscope stage is moved so that the center of the microscope field of view coincides with the present root tip estimate. The system then waits a certain amount of time before retrieving another frame from this root and repeating the process. This delay between acquisitions is used to acquire images from other, previously selected at the start of the experiment.
172
B. Garcia et al.
Fig. 2. Sample results. The dotted lines represent the border lines. The larger circle’s center point corresponds to the curvature estimate, whereas the smaller circle’s point represents the estimate derived through the central line intersection with the contour: a) typical result in normal lighting conditions; b) deficient lighting causes the central line estimation to yield a point inside the root; c) thresholding phase yields a biased curvature maximum; d) deficient lighting causes the central line estimation to yield a point inside the root.
The root tip estimation results are good, yielding a relatively robust tracking application. There are nevertheless threshold sensitivity problems, that can produce a bias in root tip estimation. However, since these biasing is fairly consistent from frame to frame, the tracking quality does not deteriorate significantly, and the meristem seldom leaves the microscope field of view. With the devised ap-
Automatic Tracking of Arabidopsis thaliana Root Meristem
173
proach herein described the system was able to track the root meristem for more than 20 hours in several experiments. In Figure 2 we present annotated results, where it can be observed the location of the estimated root tips, which serve as a basis for the computation of the final estimate (the closest to the Kalman estimate).
4
Conclusions
We presented a process to perform in-vivo confocal microscopy with Arabidopsis thaliana root meristems, through a coherent observation of the root tip. As manual acquisitions can take a significant amount of time, a successful implementation of our method with integration with a microscope stage control system proved to be valuable. The success of the system is dependent on the robustness of the observation process. We estimate root tip position by taking the root central line intersection with contour and by computing the maximum curvature point. Of the two methods presented for root tip estimation, both yielded good results. Notwithstanding, we also use a Kalman selection scheme to prevent abrupt changes. In our final testing phase, experiments with premature termination tended to have more often a biological ground for that (e.g. loss of focus or photo-bleaching) than a wrong estimation, leading to the conclusion that the method is valid for the task. It was possible to track the root meristem for more than 20 hours in several experiments. No references exist of similar tracking application in this domain and visual scale. As future work, we will try to address the loss of focus problem and take into account the 3–D information.
References 1. Dolan, L., Janmat, K., Willemsen, V., Linstead, P., Poethig, S., Roberts, R., Scheres, B.: Cellular organisation of the Arabidopsis thaliana root. Development 119 (1993) 71–84 2. Gerig, G., Kubler, O., Kikinis, R., Jolesz, F.: Nonlinear anisotropic filtering of mri data (1992) 3. Black, M.J., Sapiro, G., Marimont, D., Heeger, D.: Robust anisotropic diffusion. IEEE Trans. on Image Processing 7 (1998) 421–432 4. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 629–639 5. Weickert, J.: Theoretical foundations of anisotropic diffusion in image processing. In: Theoretical Foundations of Computer Vision. (1994) 221–236 6. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Systems, Man and Cybernetics 9 (1979) 62–66 7. Trier, O.D., Taxt, T.: Evaluation of binarization methods for document images. IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (1995) 312– 315 8. Toft, P.: The Radon Transform - Theory and Implementation. PhD thesis, Department of Mathematical Modelling, Technical University of Denmark (1996) 9. Campbell, N.A.: Robust procedures in multivariate analysis. I. robust covariance estimation. Applied Statistics (1980) 231–237
174
B. Garcia et al.
10. Marques de Sá, J.P.: Pattern Recognition. Concepts, Methods and Applications. Springer-Verlag (2001) 11. Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association (1963) 236–244 12. Mackworth, A., Mokhtarian, F.: Scale space description and recognition of planar curves and two-dimensional shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (1986) 34–43
A New File Format for Decorative Tiles Rafael Dueire Lins Universidade Federal de Pernambuco, Recife - PE, Brazil
[email protected]
Abstract. This work analyses the features of images of ceramic tiles with the aim of obtaining a compact representation for them in web pages. This allows for faster network transmission and less space for storage of tile images. The methodology developed may apply to any sort of repetitive symmetrical colour images and drawings. Keywords: Image compression, web pages, ceramic tiles.
1 Introduction “The decorative tile - azulejo in Portuguese (the word comes from the Arabic az Zulayj, “burnished stone”) – was considered a fine element of decoration by all ancient civilizations. ...Persia was the centre for development of almost all the tileproducing techniques used in Europe and was probably also the birthplace of the azulejo. The Arabs took it from their lands in the East to Italy and Spain.” 1. From Europe tiles spread worldwide, being one of the most important finishing and decorative points in architectural design today, overall in warm weather countries. Searching the Internet, one finds several hundred sites related to ceramic tiles all over the world. Those sites range from virtual museums to manufacturer catalogues. Tiles are seldom used in isolation. In general, they form panels applied to floors and walls. Since their very beginning until today, the motiv are geometric figures with welldefined contours, painted in solid colours. Very rarely a ceramic tile exhibits more than four or five colours. This paper proposes a taxonomy for tiles, based on the symmetry of their geometrical patterns. This minimises their drawing information and store them in a very efficient way, which can later generate the whole pattern. The colour information may also be stored using a representation capable to define the different regions of the tile of a reduced palette, whose colours would be replaced by the original ones at presentation time. Another possibility is to map the pattern into different monochromatic images to be later colourised and merged. The pattern images are then compressed and stored using formats such as JPEG 4567 or TIFF 8. Whenever the visualisation of files is needed, applets are loaded to assemble the original tile image from its components. These applets are being currently developed to work with compression algorithm only. However, there are plans to make them work with progressive file formats. In general, the progressive algorithms split up the original image into different “resolution layers”. Additional control information is A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 175–182, 2004. © Springer-Verlag Berlin Heidelberg 2004
176
R.D. Lins
used to reassemble the layers forming the original image. The increase in size observed in the progressive versions of GIF 45 and PNG 4 file formats was of less than 5% of the size without such facility, largely justifying their use. Surprisingly enough was the behaviour presented by JPEG, whose size of progressive compression schemes reached 10% less than the size of plain JPEG compression 3. The only drawback of such algorithms is the larger computational effort involved in processing the image for its decomposition, which may involve several scans of the original image. As processing time is by far smaller than network transmission the use of progressive algorithms is largely recommended for images visualised through a network. This paper shows the gains in space obtained with several historic tiles of different patterns.
2 Classifying Tiles This paper proposes to classify the images of tiles according to the symmetry of its drawing. To illustrate the taxonomy proposed some pictures of 19th century tiles extracted from 1 are presented. A 1s-tile forms a non-symmetric drawing. These are used in very large panels and are seldom found. Tile 1 shows such a tile. A 2s-tile presents symmetry in the diagonal (45 degree). Tile 2 below is an instance of such tile, which is symmetrical in relation to both diagonals.
Fig. 1. Tile 1: 1r-tile, Portuguese tile, 13.5 x Fig. 2. Tile 2: 2s-tile, Portuguese tile, 13.5 x 13.5 cm 13.5 cm
A 3s-tile is symmetric in relation to a 30 and 60 degree cuts from its edge, etc. These tile patterns are not often found. A different and frequently met pattern is presented by rotating one figure in relation to the centre of the tile. A 4r-tile is formed by rotating the same pattern four times. Tiles 3 and 4 are beautiful examples of this kind of design. A 6r-tile, common in hexagonal tiles, rotates a figure six times, etc.
A New File Format for Decorative Tiles
177
Fig. 3. Tile 3: 1r-tile, Portuguese tile, 14 x Fig. 4. Tile 4: 4r-tile, Portuguese tile, 13.5 x 14 cm 13.5 cm
Sub-classifications are also possible. It is very common to find tiles whose pattern is obtained by rotating four times a figure which is symmetrical in relation to its diagonal. This is the case of the tile 5 presented.
Fig. 5. Tile 5: 1r-tile, Portuguese tile, 14 x Fig. 6. Tile 6: 4r-tile, Portuguese tile, 13.5 x 14 cm 13.5 cm
As the images for the tiles presented above are from real 19th century tiles, which have been exposed to weather and sometimes bad conservation condition the images presented may include damages which are irrelevant to the purposes of this paper.
3 Working Symmetry One of the two key ideas of this paper is to take advantage of the geometry of patterns to perform image compression. Completely asymmetrical patterns such as exhibited by Tile 01 above are very rare as decorative tile, being more common in large wall tile panels with religious motifs. In these cases, little can be done in a systematic way to compress images based on geometry, as attempted herein. This section analyses the patterns of the tiles presented which show some sort of geometrical symmetry.
178
R.D. Lins
3.1 Tiles Diagonally Symmetrical Observing the two diagonals of Tile 2 above one immediately notices the symmetry of its pattern in relation to the 45 degree diagonal. This allows to store only half of the tile picture and to generate the complete pattern whenever needed. The original image was obtained by scanning a colour photograph from 1 with a flatbed scanner manufactured by Hewlett Packard, model ScanJet 5300, in true-colour 300 dpi. The total number of colours in the image is around 60,000. Compressing and storing Tile 02 image in JPEG 4567, which showed one of the best compression rate for this kind of image 3 , one obtains a file of size 78 KB. Removing the lower diagonal half of the image, one obtains:
Fig. 7. Upper half of tile 2
Fig. 8. Rectangle of upper half of tile 2
Although all relevant information such as texture, predominant colours, etc. are still in the halved image, the number of colours dropped to 36,000 and the size of the corresponding JPEG file is only 44KB, less than 57% of the size of the original JPEG file. One should remark that the halved image above has still the same number of pixels as the original image, but half of them were painted white. Another artifice is still possible, to remove the white pixels introduced by the image halving process yielding a rectangular image, such as the presented in Figure 8. This new image carries exactly the same information as the half image of Tile 2 presented in Figure 7 above, but its shape is more “regular” if “seen” from the JPEG compression algorithm perspective, thus this image yields a JPEG file of size 40KB, slightly over 50% of the original JPEG file.
3.2 Tiles 2-Diagonally Symmetrical A tile whose pattern is symmetrical regarding both diagonals, such as the tiles shown in Figures 3 and 4 above, is very often found in buildings of all ages due to its beauty and adequacy to walls and floors of private houses. The image of Tile 4 was obtained as described above and has 263K colours and size 650KB in JPEG format. Isolating one-fourth of the pattern of the tile above between diagonals one obtains a wedge as
A New File Format for Decorative Tiles
179
shown below, which corresponds to a JPEG file of size 200KB, just over 30% of the original size. Using the same artifice as before, one can transform the wedge obtained into a square, such as the one presented in the picture below:
Fig. 9. Intra-diagonal one-fourth of tile 4
Fig. 10. Square with transformed image
The image above keeps the same information as the wedge with one-fourth of the original tile, but allows a better compression in JPEG, reducing the image to 169 KB, 26% of the size of the original file.
3.3 Tiles 2s4r-Symmetrical Tile 5 (Figure 5) is an example of tile symmetrical in relation to the diagonal and the median. Its original size under JPEG compression is 97KB. Isolating the basic pattern one obtains the figure below, of size 15KB if compressed in JPEG, almost 15% of the original size:
Fig. 11. Half of intra-diagonal one-fourth of tile 4
Fig. 12. Rectangle with transformed image
Using the same transformation as performed in the case of diagonally symmetrical tiles one obtains an image figure as shown in Figure 12, corresponding to a JPEG file of same size, but with a smaller number of pixels.
4 Working Colours Very frequently tiles are monochromatic in “cobalt blue”. From the 18th Century onwords other colours started to become more popular in tiles. The manufacturing
180
R.D. Lins
process of tiles paints each colour a time. In general, colours are solid and the texture is uniform. Several experiments were performed with different tiles to try to analyse which compression scheme would produce the best results. The conclusion reached was that JPEG compression of the direct true colour image provided the best results in terms of compression. Whenever it was attempted to reduce the number of colours used and then perform either a JPEG or TIFF compression of the file resulted in files of size larger than the original JPEG one.
4.1 Non-overlapping Patterns Observing a pattern such as the one of tile 5, shown in Figure 12 above, one sees that the regions of different colours do not overlap or merge. This allows transforming the image into monochromatic and then using a TIFF format to store the resulting pattern, which is presented in Figure 13 below:
Fig. 13. Monochromatic pattern of Figure 12
The pattern shown in Figure 13 yielded a TIFF file of size 3KB, just over 3% of the original JPEG file. Together with the image above, the kind of pattern needs to be stored to allow colourising the pattern and to reassemble the original tile image, together with the predominant colour in each region. The colourising method is a simple flood-fill process 2, which can be performed very efficiently.
4.2 Overlapping Patterns In the case where the pattern merges and overlaps colours, such as in tile 4 (Figure 4), one could filter them out in different “masks” of non-overlapping monochromatic patterns and then compress them in TIFF for storage or network transmission. Adopting this strategy one obtains two monochromatic files that have together 15KB of size, less than 3% of the original JPEG file. Notice that the production of these masks needs to be performed only once and that this process can be easily automated. Another possibility is to introduce borders separating the overlapping patterns. This border would act as a flood-fill barrier and could be as thin as 1-pixel wide.
4.3 Adding Texture The colour images obtained by the flood-fill process in the different regions of the tile image or “seed” patterns that will generate the tile images look artificially unpleasant. The addition of texture removes this drawback. Two techniques are being analysed. The first one consists in adding standard “error” matrices to each RGB component the
A New File Format for Decorative Tiles
181
original image, introducing different hues, just before visualisation. A smooth lowpass filtering of the error-added image may provide a more “natural look” to the tile synthetically generated tile image. The second approach to add texture is more labour intensive. Preliminary experiments have shown that regions seem to have a “gaussian distribution” of hues centered on the most frequent colour of the region. In this technique, together with the most frequent colour of each region one stores the parameters of the hue distributions. These informations are used to colourise the different regions of the “seed” patterns.
5 Conclusions and Lines for Further Work This paper presents a way to compress images of ceramic tiles that makes network transmission and storage more efficient. The same approach may also be used in other materials such as textiles and carpets due to the similarity of their print. The pattern classification, now performed by the webmaster may be later performed automatically. Modern tiles are produced by machines and pattern variation is minimal, thus making easier automatic recognition of the symmetry pattern. On the other hand, historical tiles were hand made and patterns vary not only between pieces, but also within a piece. Criteria are being developed for automatic pattern classification. As already mentioned, the images presented herein were obtained by scanning the most beautiful tile images presented in reference 1. This process, however presents the drawback of providing non-uniform final image resolution as the images presented in 1 vary in size. Thus, it is under consideration to obtain new images by direct digitally photographing those historical tiles, having the pioneer work of the late researcher António Menezes e Cruz extended by Silvia Tigre Cavalcanti, reported in reference 1 as a guidebook. One fundamental issue is also being addressed: what is the appropriate (minimal) resolution images should have in order to provide enough details to observe the beauty of the tiles at the lowest storage and network transmission costs. The technology presented is being used in assembling a Virtual Museum of Tile in Pernambuco, currently under construction and should in its first phase exhibit the tiles presented in reference 1, being later extended to encompass the different historical tiles of the region. Its web site is http://www.museudoazulejo.ufpe.br. Acknowledgements. The research reported herein was partly sponsored by CNPqConselho Nacional de Desenvolvimento Científico e Tecnológico of the Brazilian Government, to whom the author expresses his gratitude.
References 1. 2.
S.Cavalcanti and A.Menezes e Cruz. Tiles in the Secular Architecture of Pernambuco, 19th Century, Metalivros, 2002. M.Berger, Computer Graphics with Pascal. Addison-Wesley, 1986.
182 3. 4. 5. 6. 7. 8.
R.D. Lins
R.D.Lins and D.S.A.Machado, A Comparative Study of File Formats for Image Storage and Transmission, vol. 13(1), pp 175-183,2004, Journal of Electronic Imaging, Jan/2004. J.Miano. Compressed Image File Formats: JPEG, PNG, GIF, XBM, BMP. Addison Wesley Longman. Inc, 1999. J.D.Murray, D.James and W. vanRyper. Encyclopedia of Graphics File Formats. O’Reilly & Associates, Inc, 1996. G.K.Wallace. The JPEG Still Picture Compression Standard. Communication of the ACM, Volume 34, Number 4, April 1991, pp 31-44. E.Hamilton. JPEG File Interchance Format. V 1.02, C-Cube Microsystems, September 1992. , TIFF. Revision 6.0 Final - Aldus Corporation June 3,1992.
Projection Profile Based Algorithm for Slant Removal Moisés Pastor, Alejandro Toselli, and Enrique Vidal Institut Tecnològic d’Informàtica. Departament de Sistemes Informàtics i Computació Universitat Politècnica de València Camí de Vera s/n, 46071-València, Spain {moises,ahector,evidal}@iti.upv.es
Abstract. The slant is one of the main sources of handwritten text variability. The slant is the clockwise angle between the vertical direction and the vertical text strokes. A well formalised and fast method to estimate the slant angle is presented. The method is based on the observation that the columns distribution of the vertical projection profile presents a maximum variance for the non slanted text. A comparative with Sobel operators convolution method and the Idiap slant method is provided.
1 Introduction Writers exhibit different writing styles and, even one writer usually presents different styles depending on his or her mood and the writing speed. Current image models cannot catch such text variability. Normalisation is needed to make the writing style as uniform possible. Although not a general, standard solution for style normalisation, there are some well-know preprocessing steps that are common in almost every system: slope and slant removal, and size normalisation. The slope is the angle between the horizontal direction and the direction of the line on which the writer aligned the word. The slope removal preprocess tries to correct this angle obtaining a text aligned with the horizontal direction. The slant is the clockwise angle between the vertical direction and the dominant direction of the vertical strokes. The aim of slant removal is to make the text invariant with respect to the slant angle, while the size normalisation tries to make the system invariant to the size of characters. In this work a method for slant correction is presented. The same method can be used as a technique for slope angle estimation as it is demonstrated in [3,8]. To obtain the dominant angle most methods calculate the average angle of closeto-vertical strokes. The problem here is how to choose the close-to-vertical strokes and how much influence has every stroke in the angle calculation. Many methods have been developed for slant angle estimation, such as those based on image convolution using Sobel edge operators. Typically, these methods make a convolution of the image using vertical and horizontal Sobel kernels, where the gradient phase angle is calculated for every point of the image. A histogram of all angles is computed. In order to obtain the relevant angles (those which are close-to-vertical direction) a triangular o Gaussian filter, centred at 90 degrees, is applied to the histogram. The mean (or the most frequent) angle of the histogram is taken as the slant angle [9,5]. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 183–190, 2004. © Springer-Verlag Berlin Heidelberg 2004
184
M. Pastor, A. Toselli, and E. Yidal
Other methods are based on structural information. After thinning the text, it is encoded using a conventional 8-directional chain code. The strokes close to the vertical direction are selected. The average of the angles of those strokes, weighted with their length, is considered as the slant angle [10]. A similar method is used in [4], where the vertically oriented contour is computed. This is done by taking only the horizontal black-white and white-black transition pixels into account. The angle distribution of the writing’s contour is accumulated in an angle histogram. To reduce the influence of horizontal contour lines only nearly vertical parts are considered. Other methods are those based on the vertical projection profile. The text is first sheared for a discrete number of angles around the vertical orientation. For each of these images, the vertical projection profile is calculated. The profile giving the maximum variation is taken as the profile corresponding to the deslanted image. In [3] the Wigner Ville Distribution is used to calculate de degree of variation trough the different vertical projection profiles. Another method based on vertical projection profile can be found in [8]. In order to select the columns belonging to vertical strokes, the normalised vertical projection profile is built, where all columns on the vertical projection are divided by the distance between the first and final black point in that column on the original image. Normalised vertical projection profile columns whose value is one, are considered to belong to vertical strokes. For each unnormalised vertical projection profile the sum of squared selected columns is performed. The profile which presents the greatest sum is taken as corresponding to the profile of the deslanted text (we will refer to this method as Idiap slant method). In this paper, a simple method for the slant correction is presented which also follows the vertical projection profile approach. It calculates the main slant angle of the text or, if there exists more than one dominant angle, it computes a “centre of mass” angle which summarises the most important of these angles. To choose the maximum variation projection profile most of the methods discussed above use complicated functions or are based on heuristics. The one presented here is simply based on the standard deviation of the projection profile histogram. The slant method proposed here is explained in next section, while in the third section the databases used to test the slant angle method are presented. In forth section the system used to make the experiments is described. Section fifth summarises the results.
2 Slant Angle Estimation The method is based on the observation that the distribution of the vertical projection profile histogram presents a maximum variation for the deslanted text (see fig.1). This is mainly due to the effect of ascenders and descenders, and also to the effect of every vertical component of the text. The text image is sheared for a discrete number of angles (typically from -45 to 45 degrees with 1 degree step) with respect to the vertical direction. Assuming that the slant angle for the original image is the sheared image with will produce a non slanted text. This happens if is less or equal to maximum angle used in the shearing process (usually 45 degrees). To discriminate this deslanted image from the rest of sheared images, the vertical projection profile for every sheared image and the corresponding vertical projection histogram are built. The vertical projection
Projection Profile Based Algorithm for Slant Removal
185
histogram which presents the maximum variability is chosen as corresponding to the deslanted text. In a more formal way, let be the standard deviation of the vertical projection profile histogram obtained by shearing the image I an angle The angle which maximises the function is supposed to be the complementary of the image slant angle:
A representation of the distribution of the standard deviation for the Spanish word ochocientos (eight hundred), previously slanted to -20 degrees (the same that appears in fig. 1) can be see in figure 2 left. Unfortunately, it is quite usual to find words with more than one dominant slant angle; this means that there are dominant word strokes written with different angles. In these cases, the dominant angle might be not the best representative angle for the overall slant of the word. Figure 2, right, shows two dominant angles. In these cases, some kind of smoothing is necessary. A simple technique is to take into account all angles for which the standard deviations are close enough to the greatest one. The closeness is measured as a fraction of this standard deviation. The resulting overall slant angle, is computed as the mass centre of those angles whose standard deviation is greater than or equal to
where R is the set of values for which The image shearing is an expensive, time consuming process. In order to save computation, the image is not actually sheared; instead, all vertical projection profiles are computed directly. The algorithm is based on the following strategy:
A matrix of size is built to store every vertical projection (one for each shear angle). For each row on the original image, all column displacements (one per
186
M. Pastor, A. Toselli, and E. Vidal
each angle) are calculated. For each black pixel on the current row, the target column on each corresponding projection profile is increased.
3 Handwritten Text Image Corpora In order to test the effectiveness of the proposed deslanting technique two different handwriting recognition tasks are considered.
3.1 Spanish Numbers Task The Spanish numbers handwritten corpus [7,1] is composed of numeric quantities written in words by 29 writers, each one writing 17 numbers. This is the kind of data typically found in bank check legal amount recognition tasks. The total amount of data is 493 phrases and 2127 words with a vocabulary consisting on 52 different words. corpus was splited into two sub-corpora: test and training. The test sub-corpus consists of 187 sentences, while the training sub-corpus has 298 sentences.
3.2 Odec Task In this case we consider a handwritten text recognition task entailing casual, spontaneous writing and a relatively large vocabulary. This application consists of recognition casual handwritten answers extracted from survey forms made for a telecommunication company1. These answers were handwritten by a heterogeneous group of people, without any explicit or formal restriction relative to vocabulary. On the other hand, since no guidelines are given as to the kind of pen or the writing style to be used, paragraphs become very variable and noisy. Some of the difficulties involved in this task are worth mentioning. In some samples, the stroke thickness is non-uniform. Other samples present irregular and non-consistent spacing between words and characters. Also, there are words written using different case, font types and sizes intra-word and some samples including foreign language words or sentences. On the other hand, noise and non-textual artifacts often appear in the paragraphs. Among these noisy elements there appear unknown words or words containing orthographic mistakes, as well as underlined and crossed-out words. Unusual abbreviations and symbols, arrows, etc. are also within this category. The combination of these writing-styles and noise may result in partly or entirely illegible samples and make preprocessing, and deslanting in particular, quite challenging processes. The image data-set extracted from survey forms consists of 913 binary images of handwritten paragraphs containing 16,371 words with a vocabulary consisting on 3,308 different words. The resulting set was then partitioned into a training set of 676 images and a test set including the 237 remaining images. 1
Data kindly provided by ODEC, S.A. (www.odec.es)
Projection Profile Based Algorithm for Slant Removal
187
Fig. 1. Different angle slanted text along its associated vertical projection and corresponding histogram. The histogram for the less slanted text exhibits a greater dispersion. The standard deviation for these histograms are: 6.82, 9.04, 12.04, 7.56 6.33; for 40, 20, 0, -20, -40 slant angle degrees respectively.
188
M. Pastor, A. Toselli, and E. Vidal
Fig. 2. Left panel shows the standard deviations distribution for all vertical projection profiles corresponding to the word ochocientos (eight hundred) -20 degrees slanted image that appears on the figure 1. It can be see that the maximum standard deviation correspond to 20 degrees. Right panel shows the standard deviations for the word quinientos (five hundred). This word exhibits two dominant slant angles.
4 The Continuous Handwriting System The system used here is composed by 3 modules: preprocess, feature extraction and recognition (see [7]). The preprocess consists of skew correction, slant correction and size normalisation. The original slant correction method used in this system was based on image convolution, using Sobel edge operators. The best result obtained with this system is used as a baseline (see table 1). Here, the slant module is changed for the one presented in section 2, and also for our implementation of the Idiap slant method. Each preprocessed text sentence image is represented as a sequence of feature vectors. To do it, the feature extraction module applies a grid to divide the image into N × M squared cells. In this work, N = 20 is adopted. For each cell, three features are calculated: normalised gray level, horizontal gray level derivative and vertical gray level derivative. To obtain smoothed values of these features, feature extraction is not restricted to the cell under analysis, but extended to a 5 × 5 cells window, cantered at the current cell. To compute the normalised gray level, the analysis window is smoothed by convolution with a 2-d Gaussian filter. The horizontal derivative is calculated as the slope of the line which best fits the horizontal function of column-average gray level. The fitting criterion is the sum of squared errors weighted by a 1-d Gaussian filter which enhances the role of central pixels of the window under analysis. The vertical derivative is computed in a similar way. At the end of this process, a sequence of M vectors, each one with 60 dimensions (20 normalised gray level, 20 horizontal derivatives and 20 vertical derivatives), is obtained [7]. The character models used in this work are continuous density left-to-right Hidden Markov Models, 6 states per model, each one with a 16 Gaussian densities mixture for the Spanish number task and 64 Gaussian densities for the Odec task. The number of Gaussian densities and the number of states were empirically chosen after tuning the system. It should be noted that the number of Gaussians and the number of states define the amount of parameters to be estimated and they strongly depend on the number
Projection Profile Based Algorithm for Slant Removal
189
of training vectors available. The character models are trained using the Baum-Welch algorithm, while the Viterbi algorithm was used for decodification. The total number of different character models for Spanish numbers task and Odec tasks are 19 and 80 respectively. For the first task, a finite-state network is used as language model. The language of Spanish written numbers follow a formal syntax. Therefore, a finite-state automaton was constructed by hand which covers all text Spanish numbers from 0 to Unlike of the Spanish number application, in the Odec task, are used as language models in the recognition process. They model the concatenation of words of each sentence [2], using the previous words to predict the current one; that is,
were is a sequence of words. .N-grams can be easily represented by finite-state deterministic automata and can be max-likelihood learned from a training (text) corpus, by simply counting relative frequencies of sequences in the corpus [2]. On this task the best results were obtained using bi-grams, with Witten-Bell back-off smoothing [11,12] and trained using the transcription set.
5 Experimental Results Slant normalisation methods can not be evaluated in an isolated way. Because of that, the effectiveness of slant correction has to be assessed as a function of the performance of a handwritten recognition system. The performance figure used here is the Word Error Rate (WER), calculated as:
where: and arethe number of insertions, substitutions, deletions and correct words, respectively. The performance of the system is shown in table 1. Results are calculated for several values of (see section 2). The result labelled with Sob is the best value obtained using the image convolution, Sobel edge operators technique, and the result labelled with Idiap is the performance obtained using the Idiap slant method. The best result is obtained taking into account only angles whose standard deviation is greater than or equal to 98% of the main slant angle of the words for the Spanish numbers task and 94% for the Odec task. It should said that the original baseline for the Odec task was 54.3 WER [6]. The baseline presented in this work was achieved after improving some other preprocessing steps. The productivity of the method is measured as the number of phrases that it can process per second. On an Intel 1.5MHz Pentium IV computer the implementation of the algorithm shown in section 2 processes about 125 characters per second, which corresponds to approximately 20 words and 4.5 sentences per second. This measure includes both slant estimation and slant correction.
190
M. Pastor, A. Toselli, and E. Vidal
6 Conclusions A simple, well formalised and fast method for slant angle estimation is presented. The method calculates the dominant slant angle or a weighted average of the main dominant angles if there exists more than one. For the corpora used in this work, the experiments show that the best performance is obtained using not only the most dominant slant angle. Moreover, the method improves the results obtained using Sobel operators convolution or using Idiap slant method.
References 1. J.González, I.Salvador, A.H.Toselli, A.Juan, E.Vidal, and F.Casacuberta. “off-Line Recognition of Syntax-Constrained Cursive Handwritten Text.” In Proc. of Joint IAPR Int. Workshops SSPR 2000 and SPR 2000, volume 1876 of Lecture Notes in Computer Science, pages 143153, Springer-Verlag, 2000. 2. F. Jelinek “Statistical Methods for Speech Recognition” Ed. MIT Press,1998, 3. E.Kavallieratou, N.Dromazou, N.Fakotakis and G.Kokkinakis “An integrated system for handwritten document image processing” International Jounal on Pattern Recognition and Artificial Intelligence. Vol 17, No. 4, pp. 617-636, 2003. 4. U.Marti and H.Bunke “Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system” International Jounal of Pattern Recognition and Arificial Intelligence. Vol. 15, No.l pp. 65-90, 2001. 5. Changming Sun and Deyi Si “Skew and Slant Correction for Document Images Using Gradient Direction” 4th International Conference on Document Analysis and Recognition, pp. 142-146, 1997. 6. A.Toselli, A.Juan and E.Vidal “Spontaneous Handwriting Recognition and Classification”, Proceedings of the 17th International Conference on Pattern Recognition, 2004. Accepted 7. A.Toselli, A.Juan, D.Keysers, J.González, I.Salvador, H.Ney, E. Vidal and F.Casacuberta “Integrated Handwriting Recognition and Interpretation using Finite-State Models.” International Journal of Pattern Recognition and Artificial Intelligence, 2003. To appear. 8. A. Vinciarelli and J.Luettin “A new normalization technique for cursive handwritten words” Pattern Recognition Letters. Vol. 22, No.9, pp. 1043-1050, 2001. 9. B. Yanikoglu and P. Sandon “Segmentation of Off-line Cursive Handwriting using Linear Programming” Pattern Recognition, Vol. 31, pp. 1825-1833,1998. 10. Daekeun You an Gyeonghwan Kim “Slant correction of handwritten strings based on structural properties of Korean characters” Pattern Recognition Letters,No. 12, pp. 2093-2101, 2003. 11. Slava M. Katz “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer” IEEE Trans. on Acoustics, Speech and Signal Processing, pp. 400-401,1987 12. Ian H. Witten and Timothy C. Bell “The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression” IEEE Trans. on Information Theory, Vol. 17, num 4, jul 1991.
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal from Binary Document Images Amr R. Abdel-Dayem, Ali K. Hamou, and Mahmoud R. El-Sakka Computer Science Department, University of Western Ontario London, Ontario, Canada {amr,ahamou,elsakka}@csd.uwo.ca
Abstract. Noise removal from binary document and graphic images plays a vital role in the success of various applications. These applications include optical character recognition, content-based image retrieval and hand-written recognition systems. In this paper, we present a novel adaptive scheme for noise removal from binary images. The proposed scheme is based on connected component analysis. Simulations over a set of binary images corrupted by 5%, 10% and 15% salt-and-pepper noise showed that this technique reduces the presence of this noise, while preserving fine thread lines that may be removed by other techniques (such as median and morphological filters).
1 Introduction When scanning a paper document, the resulting image may contain noise from dirtiness on the page during the acquisition process. This noise may hinder the translation of the image into ASCII characters while performing Optical Character Recognition (OCR). It may also degrade an image such that key characteristics are damaged. There are various classes of document noise that can corrupt images. These include, salient noise [1], white line dropout noise [1], shadowing noise [2], and saltand-pepper noise [3]. This study will mainly focus on salt-and-pepper noise in binarized document and graphic images. Salt-and-pepper noise is characterized by black dots on a white background and white dots on a black background. Each noise dot can be either an isolated pixel or composed of several pixels. Noise, in binarized images, can be eliminated by using various techniques including noise modeling [4] and filtering [5]. Noise modeling can be used to formulate noise within an image in order to remove it, however, prior noise information must be known in order to do so. Image filtering can be grouped into two main categories: linear and nonlinear filtering. Linear filters are mathematically simple, however they generally cause distortions in edges. Hence many researchers have been seeking various methods on how to design nonlinear filters that help maintain structural information. Since this paper will deal with nonlinear filtering, a short summary of the two major nonlinear filtering techniques, morphology and median, for binarized noisy images is provided below. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 191–199, 2004. © Springer-Verlag Berlin Heidelberg 2004
192
A.R. Abdel-Dayem, A.K. Hamou, and M.R. El-Sakka
Morphology is a methodology for studying image structure using set theory [6]. The two basic operators used for all morphological operations are erosion and dilation. These two operators can be used to remove salt-and-pepper noise by applying erosion followed by dilation operations, known as opening, and their inverse, known as closing, over the entire image. Morphology may damage the edge integrity of both text and graphic components when large structuring elements are used. Several researchers have developed modified morphological techniques to improve on the removal of noise from binary images [7][8][9], however, they still suffer from the performance deficiencies of morphological filters stated above. Median filters have often demonstrated to offer good performance on impulse noises, however, they tend to suffer from streak noises at higher noise rates [10]. Another fault is that thread lines and corners may be removed since such objects are isolated causing the filter to attribute such features as noise. Hence the cost of noise suppression on binary images may result in degraded edge features. Many techniques have been created to overcome these drawbacks creating a modified set of filters known as adaptive median filtering. Some of these filters include, center-weighedfiltering [11], vector-median-filtering [12], image continuation algorithms [13], and rank-ordered-filtering [14]. These adaptive median filters still employ standard median filter techniques throughout the entire image and hence, suffer from the disadvantages stated above. The two major filtering techniques for binarized documents and graphic images forfeit the details of fine thread lines and corners when applied to an entire image. Hence, in this paper, an adaptive filtering algorithm that provides the necessary low time complexity with adequate functionality to remove the salt-and-pepper noise without damaging the document image is presented. This algorithm will maintain the integrity of thread lines and close proximity pixels. Pixel connectivity is used as a yardstick to provide proper estimation of true edges. The paper is organized as follows. In Section 2 the proposed algorithm will be outlined. Section 3 contains the experimental setup, followed by the discovered results in Section 4. Finally, we will outline the conclusions in Section 5.
2 The Proposed Method We propose a novel method for noise reduction in binary document images. The idea is based on the hypothesis that random noise usually appears as isolated pixels or as a group of clustered pixels, which tend to be small in size. A graph-theoretical approach is used to cluster the image pixels into two classes, namely: the isolated pixel and the non-isolated pixel classes. The graph connectedness property is used as the main feature during the classification process. We define an object as a group of connected pixels that have the same color. Thus, in theory, the image can contain different objects with sizes from one pixel to the total number of pixels in the image. The algorithm uses a user-specified parameter called connected-component-size. This parameter is used as a threshold to classify image objects as isolated or non-isolated. In setting this parameter, there is a tradeoff between noise removal, and fine detail
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal
193
preservation. The judgment is left to the user to tune this trade off based on the image at hand. The algorithm uses the depth-first-search technique [15] to scan the eightconnected component of a pixel with certain size (connected-component-size). Starting from pixel Pij, the algorithm searches through a 2L×2L grid centered at pixel Pij, where L is the connected-component-size parameter, specified by the user. A module called check_connected_component is used to check whether pixel Pij belongs to a connected object with size greater than the connected-component-size parameter, or not. Fig. 1 shows a detailed description of this module. For every pixel in the image, the check_connected_component module is invoked to label the pixel as an isolated or a non-isolated pixel. Isolated pixels have a high probability of being noisy pixels. Thus, their values can be reconstructed by any noise removal method found in literature. To prove the concept, we simply used, an n×n median filter to remove this noise, where n is another user specified parameter. The reconstructed pixel value is copied to an output buffer. Non-isolated pixels have a high probability of being uncorrupted. So, they are copied directly to the output buffer without any change. Note that the use of the output buffer allows our scheme to accomplish its task with a single pass through the image. It is worth mentioning that the time complexity of our scheme is in the order of W×H, where W and H are the image width and height respectively. For each pixel, the algorithm searches through a grid of dimension 2L×2L, where L is the connected component parameter specified by the user. Usually, the value of this parameter is small (when compared to the image dimension). This leaves the algorithm complexity within the boundary stated earlier. It was empirically found that the majority of noisy pixels in document images are clustered into small components. Thus, by setting the connected_component_size parameter to three and setting the median filter window to 3×3, it yielded good results.
3 Experimental Setup To test our scheme, we generated binary images with features that typically exist in most document images such as characters, numbers, symbols and lines. Fig. 2(a) shows one of these generated test images, which contains two character sets. Each set was written with a different font (16-point Gothic Book and 16-point Times New Roman). Note that, the former font has one-pixel stroke width, while the latter has variable stroke width. Each character set is surrounded by a rectangular (or roundrectangular) frame. These frames are added to represent different drawing objects that usually exist in flowcharts, block diagrams and computer drawings. We intentionally kept the distance between the two rectangles to be one pixel. This is to test the behavior of our scheme when dealing with fine details.
194
A.R. Abdel-Dayem, A.K. Hamou, and M.R. El-Sakka
Fig. 1. The check_connected_component module.
This generated image is corrupted by adding random salt-and-pepper noise. The noise is added with different densities to check the robustness of our scheme. Fig. 2(bd) shows the image after adding 5%, 10% and 15% salt-and-pepper noise to the image pixels, respectively.
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal
195
Fig. 2. The generated test images (a) The original image. (b-d) The image in (a) after adding random salt-and-pepper noise with different densities: 5%, 10% and 15%, respectively.
196
A.R. Abdel-Dayem, A.K. Hamou, and M.R. El-Sakka
Finally, we must compare our scheme with some of the common methods found in literature. Please note that we assume no prior knowledge about the noise distribution. Hence, we cannot use adaptive median filters, which changes the filter size based on the noise probability distribution on the surrounding block. At the same time central weighted median filters are based on assigning more weight to the central pixel. In our case, we use a black and white image. Each pixel has a probability of 0.5 to be either black or white. Thus assigning more weight to the central pixel will increase the probability of misclassifying a noisy pixel. We believe that in our case, the standard median filter will outperform the central weighted one. In order to minimize the distortion in the reconstructed image, a small filter size is desirable. Thus we used a 3×3 median filter. We will also include opening and closing morphological filters in the comparison. The output of the morphological filters depend on the order of applying the opening and closing operations as well as the shape of the structuring element. The opening operation will remove the salt noise, while the closing operation will remove the pepper noise. Generally, smaller structuring elements will minimize the character distortion. Thus, in this study, we used 2×2 structuring element during the opening and closing operations. Using 2×2 structuring element will cause a one pixel shift in the resulting image. However, this shift is irrelevant during the visual inspection phase.
4 Results We applied the proposed scheme, the median filter, the opening-closing and the closing-opening morphological filters, as shown in Fig. 3, to remove the salt-andpepper noise from the 5% corrupted image, shown in Fig. 2(b). Fig. 3 (a) shows the result of the proposed scheme, where the majority of the salt-and-pepper noise was removed while minimizing the shape distortions for all the objects in the image. In order to demonstrate this distortion, let us consider the character ‘H’ in the first line. If we return to Fig. 2(b), we will discover that part of the letter ‘H’ is covered with salt noise (the white pixel). This causes the algorithm to label this part as isolated pixels. Thus it is considered noise and removed by the median filter. This situation can be resolved by trying different values for the connected_component_size parameter. At the same time both the median filter and the morphological filters fail to produce readable characters, especially in the case of the first character set. They also remove the two rectangles surrounding the text. The line in the middle of Fig. 3(b) results from the inversion of the white line that exists between the borders of the two rectangles in the original image. The thick line in the middle of Fig. 3(c) results from the opening operation, which fills the gap between the two original lines. This line was completely removed in Fig. 3(d) due to the closing operation, which wiped out the two parallel black lines.
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal
197
Fig. 3. The 5% noisy image shown in Fig. 2(b) after applying (a) The proposed method. (b) A 3×3 median filter. (c) A 2×2 morphological opening followed by a closing operation (d) A 2×2 morphological closing followed by an opening operation.
198
A.R. Abdel-Dayem, A.K. Hamou, and M.R. El-Sakka
Fig. 4. The reconstructed images after applying the proposed method: (a) the image with 10% salt-and-pepper noise. (b) the image with 15% salt-and-pepper noise.
Since, both the median filter and the morphological filters fail to produce adequate results with the 5% noisy image, they will definitely fail at higher noise levels. Thus, for higher noise levels, we will only present the results of our scheme. Fig. 4 shows the output from our scheme when applied to the 10% and 15% noisy images shown in Fig. 2(c,d). The proposed scheme still produces good results even with the presence of higher noise levels. This demonstrates the power of the proposed scheme in removing salt-and-pepper noise.
5 Conclusions In this paper, we proposed a novel mechanism for removing salt-and-pepper noise from binary document images. The scheme is based on connected component analysis. Experimental results showed that our scheme outperforms the median and the morphological filters. The time complexity of our system lies within the same order of the time complexity of the median filter. One of the major features of our system is that no prior knowledge about the statistical model of the noise is required. The user, depending on the image at hand, can balance between noise removal and preservation of fine details.
Novel Adaptive Filtering for Salt-and-Pepper Noise Removal
199
Acknowledgements. This research is partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). This support is greatly appreciated.
References 1.
2.
3.
4. 5. 6. 7.
8.
9.
10. 11.
12.
13. 14.
15.
Yoon, M., Lee, S., Kim, J.: Faxed image restoration using Kalman filtering. Third International Conference on Document Analysis and Recognition. Vol. 2, (1995) 677-680. Ping, Z., Lihui, C., Alex, K.C.: Text Document Filters Using Morphological and Geometrical Features of Characters. 5th International Conference on Signal Processing, Vol. 1, (2000) 472 – 475. Chinnasarn, K., Rangsanseri, Y., Thitimajshima, P.: Removing Salt-and-pepper Noise in Text/Graphics Images. IEEE Asia-Pacific Conference on Circuits and Systems. (1998) 459-462. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, Second Edition. Prentice Hall Inc. (2002) 220 – 276. Schroeder, J., Chitre, M.: Adaptive mean/median filtering. Proceedings of the Asilomar Conference on Signals, Systems and Computers, Vol. 1, (1996) 13-16. Dargherty, E., Lotufo, R.: Hands–on Morphological Image Processing. The society of Photo-Optical Instrumentation Engineers. (2003). Zheng, Q., Kanungo, T.: Morphological Degradation Models and Their Use in Document Image Restoration. International Conference on Image Processing, Vol. 1, (2001) 193-196. Ozawa, H., Nakagawa, T.: A Character Image Enhancement Method from Characters With Various Background Images. International Conference on Document Analysis and Recognition. (1993) 58-61. Ali, M.B.: Background Noise Detection and Cleaning in Document Images. Proceedings of the 13th International Conference on Pattern Recognition, Vol. 3, (1996)758-762. Bovik, A.: Streaking in Median Filtered Images. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 35, No. 4, (1987) 493-503. Ko, S., Lee, Y.H.: Center Weighted Median Filters and Their Applications to Image Enhancement. IEEE Transactions on Circuits and Systems, Vol. 38, No. 9, (1991) 984-993. Barni, M., Buti, F., Bartolini, F., Cappellini, V.: A Quasi-Euclidean Norm to Speed Up Vector Median Filtering. IEEE Transactions on Image Processing, Vol. 9, No. 10, (2000) 1704-1709. Billawala, N., Hart, P.E., Peairs, M.: Image continuation. Proceedings of the Document Analysis and Recognition. (1993) 53-57. Abreu, E., Lightstone, M., Mitra, S., Arakawa, K.: A New Efficient Approach for the Removal of Impulse Noise from Highly Corrupted Images. IEEE Transactions on Image Processing, Vol.5, No. 6, (1996) 1012-1025. Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley Longman Publishing. (1989).
Automated Seeded Region Growing Method for Document Image Binarization Based on Topographic Features* Yufei Sun1,2, Yan Chen1,2, Yuzhi Zhang1, and Yanxia Li1,2 1
Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China {sunyf, chenyan}@ict.ac.cn {
[email protected],
[email protected]} 2
Graduate School of the Chinese Academy of Sciences, 100039 Beijing, China
Abstract. Binarization of document images with poor contrast, high noise and variable modalities remains a challenging problem. This paper proposes a new binarization method that adopts the use of seeded region growing and character’s topographic feature. It consists of three steps: first, seed pixels are selected automatically according to their topographic features; second, regions are grown controlled by new weighted priority until all pixels are labeled black or white; third, noisy regions are removed based on the average stroke width feature. Our method overcomes the difficulty of global binarization to find a single value to fit all. It also avoids the common problem in most local thresholding technique of finding a suitable window size. The proposed method performed well in binarization and the experimental results of evaluation showed significant improvement compared to several other methods.
1 Introduction Binarization is an extremely important step in a document image analysis and recognition system in that it affects the performance of subsequent analysis. Binarization techniques can be typically divided into two classes: global (static) threshold and local (adaptive) threshold. Global thresholding algorithms use a single threshold value for the entire image. For most of them, the global threshold is determined by optimizing certain statistical measures, such as between-class variances [1], entropy [2] and clustering criteria [3]. In [4], the performance of more than 20 global thresholding algorithms was evaluated using uniformity or shape measures. It is obvious that just one threshold value is usually not possible to fit all conditions in the entire image. In contrast, local thresholding is more adaptive to compute a separate threshold for each pixel based on the neighborhood of a pixel and depends upon some local statistics like contrast measure [5] and weighted running average [6]. Eleven different locally adaptive binarization methods were evaluated in [7] and Niblack’s method [8] gave the best performance. The most important * The research is supported by the National High Technology Research and Development Program of China (No.2003AA1Z2230). A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 200–208, 2004. © Springer-Verlag Berlin Heidelberg 2004
Automated Seeded Region Growing Method for Document Image Binarization
201
parameter of the local threshold methods is the window size, which is related with the size of objects of interest in the image. If the window size is not set appropriately, binarization may be incorrect. Although some new techniques are involved in, the selection of proper window size is still difficult. Threshold methods neglect all the spatial information of the image and do not cope well with noise or blurring at boundaries. Especially for images with high noise and poor contrast, improper thresholding can cause merges, fractures and other deformations in the character shapes, which are known to be the main reasons of Optical Character Recognition (OCR) performance deterioration. A new method for image binarization is proposed in this paper. This method does not have the explicit threshold and search the print and background pixels directly by using an automated Seeded Region Growing (SRG) technique with higher-level knowledge constraints. It utilizes character’s topographic features to select the seed pixels as well as to control the region growth and uses the average stroke feature in postprocess. These features contain essential structural information; hence the final binarization result can preserve the useful character structure well.
2 Traditional SRG and Problems “Seeded Region Growing” [9] is a method for image segmentation. It begins with a set of marked seed pixels (or regions) which are grown until all the image pixels have been processed. This growth is controlled by a priority queue mechanism which is dependent on the similarity of pixels within regions. The objective of our binarized approach is to extract textual information from an image. In order to accomplish this, there are three very important problems that the traditional SRG algorithm should overcome: 1) Automatic selection of the seed pixels. The traditional SRG method has a main disadvantage: selection of seed pixels has to be done manually. For some applications like segmenting a medical image into a few regions, user’s participation of marking several seed pixels is possible and acceptable. However, considering applications in document image binarization, this disadvantage becomes intolerable. Thus, automatic seed pixel selection technique with high effectiveness must be developed. 2) Improved rules for region growing. In the implementation of traditional SRG, the growth of seed regions is controlled by a priority queue mechanism. The priority measure is how far the intensity of the regarded pixel is from the intensity mean value of its neighboring regions. It only ensures that the final segmentation is as homogeneous as possible given the connectivity constraint. But for a variety of images, it is not tenable to assume that the gray level within character objects is homogeneous. The final shape of a region should be controlled by the higher-level knowledge to guarantee the correctness and integrity of the result. 3) Robustness in the presence of noise and small valid regions. The Chinese characters have complicated structures and the noise is also unavoidable in real image. Thus, the problem of distinguishing the small valid regions from noisy regions arises. To gain better results further corrections are required.
202
Y. Sun et al.
In this paper, the problems listed above are solved properly by adopting the character’s topographic features and the average stroke width feature. The details of our algorithm are discussed in the following section.
3 A New Binarization Method The proposed algorithm consists of three steps: (1) initialization, in which seed pixels are determined automatically and some easy-binarized pixels are simply labeled firstly; (2) region growing, in which the growth of regions is controlled by a new weighted priority, and (3) postprocess, in which further corrections are done based on the average stroke width feature.
3.1 Initialization This step includes two phases: seed pixel selection and preprocess of easy-binarized pixels. Details of each phase are given in the following subsections. Seed Pixel Selection. In the whole algorithm, the selection of seed pixels is very critical by which the region growing is controlled. According to observation and analysis of the image’s topographic structure, it is found that some pixels with local extreme features (peak or pit) just fall on typical pixels of foreground (or background) and can sketch preliminary structural information of characters, so we choose them as proper seed pixels. A variety of techniques are used to extract topographic features and Lee and Kim’s method [10] is used in this paper because their method is fast, simple and effective.
Fig. 1. An example of topographic features
Without loss of generality, we assume text is darker than its background since it is the case in majority of printed text documents. Considering an input gray-scale image, G(P) represents the gray value of a pixel P, and the gray value rang is typically [0,255]. If it is not, it will be assumed to extend from 0 to 255. First, topographic features of the image are extracted. Each pixel is classified into one of peak, pit, ridge, ravine, saddle and hillside (Fig. 1) according to rules based upon the gradient for adjacent pixels, the derivative of gradient and the principal orthogonal elements of the underlying image intensity surface.
Automated Seeded Region Growing Method for Document Image Binarization
203
Depending upon the relationships between topographic features and character structure, the following rules are used for seed pixel selection: If P is peak and G(P) is less than then P is a seed pixel and labeled black (foreground), If P is pit and G(P) is greater than then P is a seed pixel and labeled white (background). and are the thresholds used to guarantee the accuracy of seed pixel selection and to suppress the disturbance of noise. A basic threshold T acts as a baseline which is obtained by using the Otsu’s statistic method [1] that seeks the cutting value to maximize the variance between two resulted classes. Based on the possibility of overlapping gray level distribution between objects and backgrounds which is common in real images, is set a little greater than T, while is just less than T in our work. Determination of Easy-Binarized Pixels. It is obvious that there exist quite a few comparatively darker (or lighter) pixels which can be easily sorted to foreground (or background). So, there is a preprocess step to label these easy-binarized pixels first in order to reduce the unnecessary computational spending. Two parameters LT (Low Threshold) and HT (High Threshold) are set in order to define comparatively darker or lighter pixels:
Where, T is the basic threshold obtained at the above subsection. The value r is an empirical constant value and amount of trials implicate is a suitable value. The pixels that are less than or equal to LT are considered darker and directly labeled black; while those that are greater than or equal to HT are considered lighter and directly labeled white. The rest will be left to deal with in next step.
3.2 Region Growing The novel aspect of this step is to take topographic features into account in measuring the priority of region growing rule. The different topographic features are with different weighted factors. This constrain of higher-level knowledge can control the accuracy of the region growth and preserve the character structure as much as possible. The process of region growing is described as below. After the initialization, part of the pixels has been labeled or selected as seed pixels. So, 8-connective pixels with same label are grouped into sets which turn to be initial regions. Each region contains the following attributes: label, center point, area, seed pixels, and intensity mean value. Sometimes individual regions can contain more than one seed pixel. Let S be the set of unallocated pixels, which border at least one of the regions: Pixels in the S set are queued by a new priority. Instead of the original simple distance definition, different weighted factors are added to the new measure
204
Y. Sun et al.
according to the corresponding topographic features. The new priority measure between the regarded pixel x and its neighboring region is defined as:
Where, means the number of seed pixels that region contains, and W(T(x), presents the different weight factor corresponding to a pixel’s topographic feature T(x) and the label of region The less the is, the higher the priority of x. In the topographic structure of an image, peak, ridge and saddle points tend to belong to foreground, pit and ravine points tend to belong to background, and hillside points are adiaphorous. Thus, for a region labeled black, weight factors have the following relationships: W(peak, black) < W(ridge, black) < W(saddle, black) < W(hillside, black) < W(ravine, black) < W(pit, black). And for a white region, the sequence is reverse. Some regions are large and contain more than one seed pixel, whereas each seed pixel has a gray value which is typical of its small seed area. In order not to lose this typicality, when a region contains more than one seed pixel, the whole region mean value is replaced by the mean gray value of pixels which belong to NSA(x, as well as to Note that NSA(x, presents a rectangular area with the diagonal points: the pixel x and the nearest seed pixel in from x. Figure 2 illustrates this case.
Fig. 2. The case of neighboring region with more than one seed pixel. The regarded pixel is x, and its neighboring region is the shaded region which contains three seed pixels marked black cross. A rectangle area NSA between x and is shown in dashed line. The mean value is computed by the four shaded pixels in NSA area
Besides, in this algorithm implementation there will be a lot of small seed regions, so it is very important for final binarization result to overcome order dependencies as pointed out in [11]. Inherent order dependencies are overcome by parallel process and secondary measure, while implementation order dependencies are overcome by updating the priority when necessary. When the pixel x has more than one neighboring region, the minimal from the neighboring regions is chosen as the priority. In the particular case that the pixel x has
Automated Seeded Region Growing Method for Document Image Binarization
205
the same priority for several regions it borders, the secondary measure like region’s area is carried out further. Region growing is an iterative process. At each step we take each of the pixels with the same least value (the same highest priority) from S, add it to the corresponding neighboring region and assign the pixel the same label of Then update the attributes of regions that are added pixels. Finally, add new adjacent pixels to S and recalculate the new priority from the changed regions. The pseudocode of seeded region growing process is given below: Form the initial seed regions according to the initialization. Put neighbors of regions into S including the priority and neighboring region. While S is not empty do { Remove each pixel x with the highest priority from the S in parallel. { Add x to corresponding neighboring region Assign x the same Label as } Unite the connective and same label regions into one. Update the attributes of the changed regions. Put new neighbors of the changed regions in S. Recalculate the priority from the changed region mean value. } After the process ends and all the pixels are allocated, a set of regions labeled black or white are formed, denoted as set
3.3 Postprocess The approach described above is reasonably simple, but there are a few exceptional cases that need to be dealt with in order to get an applicable algorithm. These exceptions are caused by noise and some possible mistakes of seed pixel selection, which lead to noisy regions and unwanted white spots inside stroke. And there exist some small isolated strokes and small background areas in document images which are similar to the noise and white spots. Therefore, some corrections are needed in order to erase the noise and to preserve the valid structural information.
Fig. 3. a) Four natural directions H,V,L,R b)eight subdirections
The average stroke width (ASW) feature is very useful for these corrections and can be evaluated by the run-length histograms of a binarized image. Run-length histogram is denoted as one dimensional array I={1, 2, ..., ML}, where
206
Y. Sun et al.
ML is the longest run to be counted and K presents the four natural directions defined on the image lattice, namely, the horizontal direction (H), the vertical direction (V), the right-diagonal direction (R), and the left-diagonal direction (L) (Fig. 3a). The ASW feature is defined as the black run-length with the highest frequency of the runlength histograms in H&V directions excluding the unit run-length: One problem is how to judge the black noise and small isolated stroke. Although both of them are small isolated black regions, they are different. A Chinese character’s structure is compact, so a small isolated stroke is close to other strokes. In contrast, the noise is comparably far from the strokes or its area is very tiny. For each region labeled white from set R, the judging rules are: If the area of is much smaller than it is considered as a noise region and changed to white label, If the area of is smaller than calculate the first white run-lengths in eight subdirections (Fig. 3b) starting from the center point of And then if the shortest run-length of them is much longer than ASW, is a noise region and changed to white label. The other problem is how to distinguish the small background area from white spot. Like the above situation, both are small isolated white regions. The strokes surrounding the small background area are real strokes whose width is closer to the ASW, while the strokes surrounding the white spot are not complete whose width is different from ASW. For each region labeled white from set R, the judging rule is: If the area of is smaller than calculate the first black run-lengths in eight subdirections starting from the center point of And then if there exists a run-length that is much shorter than ASW, is considered as white spot inside stroke and changed to black label.
4 Experimental Results To evaluate the proposed method, hundreds of document images from newspapers dated in a wide range of years were collected. The proposed method was proved to yield satisfactory binarization quality for those images compared with other methods: Otsu’s method [1], Kapur’s entropy method [2] and Niblack’s method [8], which are the representatives for global/local threshold methods. Some experimental results are presented below. A portion of the source image with varying contrast is shown in figure 4a. In this case, the gray value of some background regions falls below the level of some strokes. Thus, no global threshold is able to yield a satisfactory binary result, irrespective what value being assumed. For low threshold values, the resulting image contains broken characters like Kapur’s entropy method (Fig. 4b). As the threshold is elevated, smeared characters emerge in the result, like Otsu’s method (Fig. 4c). But the proposed method performs well in preserving the weak strokes as well as the small background regions (Fig. 4d).
Automated Seeded Region Growing Method for Document Image Binarization
207
Fig. 4. Comparison of the proposed method with global threshold methods
As expected, Niblack’s method fails in case that the window size is set improperly, like the most other local thresholding methods. Figure 5b presents an example in which white spots emerge in large characters with a small window size. The proposed method does not have this problem and gives the satisfactory result (Fig. 5c).
Fig. 5. Comparison of the proposed method with local threshold method
Figure 6 indicates the ability of the proposed method to process the noisy and poor contrast images. Thresholding techniques do not work properly for this situation (Fig.6b and Fig.6c). However, the proposed method extracts the useful structural information successfully and gets better result than the others.
Fig. 6. Comparison of the proposed method with threshold methods
208
Y. Sun et al.
5 Conclusion The binarization of document images with variable modalities is still one of the most challenging topics in the field of document image analysis. In this paper, a new binarization method which is the extension and modification of traditional SRG algorithm is presented. It searches the print and background pixels directly and avoids the common problems in thresholding technique. The contribution of this paper is to apply the higher-level knowledge including the topographic feature and average stroke width feature to the whole binarization process. Experimental results demonstrate that our method is more effective than the other binarization methods.
References 1. Otsu, N.: A Threshold Selection Method from Gray-scaled Histogram. IEEE Trans. Systems Man and Cybernetics, Vol. 8 (1978) 62–66 2. Kapur, J.N. Sahoo, P.K., Wong, A.K.C.: A New Method for Gray-level Picture Thresholding Using the Entropy of The Histogram. Computer Vision Graphics and Image Processing, Vol. 29 (1985) 273–285 3. Kittler, J., Illingworth, J.: On Threshold Selection Using Clustering Criteria. IEEE Trans. Systems Man and Cybernetics, Vol. 15 (1985) 652–655 4. Sahoo, P.K., Soltani, S., Wong, A.K.C.: A Survey of Thresholding Techniques. Computer Vision Graphics and Image Processing, vol. 41 (1988) 233–260 5. Giuliano, E., Paitra, O., Stringa, L.: Electronic Character Reading System. U.S. Patent 4,047,15 (1977) 6. White, J.M., Rohrer, G.D.: Image Thresholding for Character Image Extraction and Other Applications Requiring Character Image Extraction. IBM J. Research and Development, vol. 27 (1983) 400–411 7. Trier, O.D., Jain, A.K.: Goal-Directed Evaluation of Binarization Methods. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17 (1995) 1191–1201 8. Niblack, W.: An Introduction to Digital Image Processing. Englewood Cliffs, Prentice Hall (1986) 115–116 9. Rolf, A., Leanne, B.: Seeded Region Growing. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 16, No. 6 (1994) 641–647 10. Lee, S.W., Kim, Y.J.: Direct Extraction of Topographic Features for Gray Scale Character Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 7 (1995) 724–729 11. Andrew, M., Paul, J.: An Improved Seeded Region Growing Algorithm. Pattern Recognition Letters, Vol. 18 (1997) 1065–1071
Image Segmentation of Historical Documents: Using a Quality Index Carlos A.B. de Mello Escola Politécnica de Pernambuco, Universidade de Pernambuco Rua Benfica, 455, Madalena, Recife, PE, Brazil
[email protected]
Abstract. It is presented herein a new entropy-based segmentation algorithm for images of historical documents. The algorithm provides high quality images and it also improves OCR (Optical Character Recognition) responses for typed documents. It adapts its settings to achieve better quality images through changes in the logarithmic base that defines entropy. For this purpose, a measure for image fidelity is applied just as information inherent to images of documents.
1 Introduction Thresholding or grey level segmentation or binarization is a conversion between a grey level image and a bi-level one. This is the first step in several image processing applications. It also can be understands as a classification between objects and background in an image. It does not identify objects; just separate them from the background. This separation is not so easily done in images with low contrast. For these cases, image enhancement techniques must be used first to improve the visual appearance of the image. Another major problem is the definition of the features that are going to be analyzed in the search of the correct threshold value which will classify a pixel as object or background. The final bi-level image presents pixels whose gray level of 0 (black) indicates an object (or the signal) and a gray level of 1 (white) indicates a background. Image segmentation techniques can be found in document processing applications where segmentation can be applied in different contexts. In particular, it can be used as a first step for character recognition when the system must separate text and graphical elements. However, depending on the kind of document in study, a previous phase must be needed to extract the background of the document improving the execution of character recognition algorithms. With document images, the background can be seen as the paper of the document and the object is the ink. When the image is from historical documents this problem is quite singular. In these cases, the paper presents several types of noise. Thus ink and paper segmentation is not always a simple task. In some documents, the ink has faded; some of the others were written on both sides of the paper presenting ink-bleeding interference. A
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 209–216, 2004. © Springer-Verlag Berlin Heidelberg 2004
210
C.A.B. de Mello
conversion into a bi-level image of this kind of documents using a nearest color threshold algorithm does not achieve high quality results. In this work, we analyze the application of the segmentation process to generate high quality bi-level images from grey-scale images of documents. The images are from the bequest of letters, documents and post cards of Joaquim Nabuco1 held by the Joaquim Nabuco Foundation (a social science research centre in Recife - Brazil). The Image Processing of Historical Documents Project (DocHist) aims at the preservation of and easy access to the content of a file of thousands of historical documents. For this purpose an environment is under development to acquire, process, compress, store and retrieve the information of Joaquim Nabuco’s files. In Nabuco’s bequest, there are documents written on one side or on both sides of the sheet of paper. In the latter case, two classes are identified: documents without back-to-front interference and documents with back-to-front interference. The first class is the most common and offer little difficulty to reduce the color palette suitably. The bi-level image can be generated from the colored one through the application of a threshold filter. A neighborhood filter [1] can also be used to reduce the “salt-and-pepper” noise in the image. Palette reduction of documents with ink-bleeding interference is far more difficult to address. A straightforward threshold algorithm does not eliminate all the influence of the ink transposition from one side to the other in all cases.
2 Materials and Methods This research takes place in the scope of the DocHist Project for preservation and broadcasting of a file of thousand of historical documents. These documents are from Joaquim Nabuco’s [2] bequest and are composed of more than 6,500 letters, documents and post cards which amounts more than 30,000 pages. The file is from the end of the nineteenth century onwards. To preserve the file, the documents are digitized in 200 dpi resolution in true color and stored in JPEG file format with 1% loss for better quality/space storage rate. Even in this format each image of a document reaches, in average, 400 Kb. Although broadband Internet access is a common practice nowadays, the visualization of a bequest of thousand of files is not a simple task. Even in JPEG file format all the bequest must consume Giga bytes of space. A possible solution to this problem is to convert the images to bi-level which is not a simple task. Some documents are written on both sides of the paper creating back-to-front interference; in others the ink has faded. Thus, the binarization by commercial softwares with standard settings is not appropriate. Figure 1 next presents a sample document and its bi-level version produced by straightforward threshold algorithms. Dithering algorithms achieve better quality images by visual means, but, for OCR purposes, they are not adequate.
1
Brazilian statesman, writer, and diplomat, one of the key figures in the campaign for freeing black slaves in Brazil, Brazilian ambassador to London (b.l861-d.1910)
Image Segmentation of Historical Documents: Using a Quality Index
211
Fig. 1. (left) Grayscale sample document written on both sides of the paper and (right) its bilevel version by a threshold algorithm.
Besides compression rates, high quality bi-level images yield better response from OCR tools. This allows the use of text files to make available the contents of the documents instead of its full digitized image. The problem remains in the generation of these bi-level images from the true color original ones. For this, an entropy-based segmentation algorithm is proposed and extended with the use of an image quality index.
2.1 New Entropy-Based Segmentation Algorithm There are several segmentation algorithms in literature. They are based in different features of images. In particular, we are focusing herein in entropy-based algorithms [3] as it is the base of our new proposal. In this class, three algorithms are summarized: Pun [4], Kapur et al [5] and Johannsen and Bille [6]. The algorithm scans the image in search for the most frequent color, t. As we are working with images of letters and documents, it is correct to suppose that this color belongs to the paper. This color is used as an initial threshold value for the evaluation of Hb and Hw as defined in Eq. 1 and 2:
and p[i] is the probability of gray level i. As defined in [7], the use of different logarithmic bases does not change the concept of entropy. By now, this base is taken as the area of the image: width by height. With Hw and Hb, the entropy, H, of the image is evaluated as their sum:
212
C.A.B. de Mello
Based on the value of H, three classes of documents were identified, which define two multiplicative factors, as follows: (documents with few parts of text or very faded ink), then mw = 2 and mb = 3; 0.25 < H < 0.30 (the most common cases), then mw = 1 and mb = 2.6; (documents with many black areas), then mw = mb = 1. These values of mw and mb were found empirically after several experiments where the hit rate of OCR tools in typed documents defined the correct values. With the values of Hw, Hb, mw and mb the threshold value, th, is defined as: The grayscale image is scanned again and each pixel i with graylevel[i] is turned to white if: Otherwise, its color remains the same (to generate a new grayscale image but with a white background – this one are going to be used for the quality analysis next) or it is turned to black (generating a bi-level image). This is called the segmentation condition. This new segmentation algorithm was used for two kinds of applications: 1) to create high quality bi-level images of the documents for minimum space storage and for efficient network transmission and 2) in means to achieve better hit rates of OCR commercial tools. This algorithm was tested in a file of more than 500 sample documents representative of the complete bequest of Joaquim Nabuco. Although its great performance, some special cases were identified in which, as defined, the algorithm is inefficient. In some documents, an exchange in the logarithmic base reached a perceptually better quality image. It can be considered that the base corresponds to the number of possible grey levels in the image, i.e., 256. However, there are worse cases where this change does not work well yet. In these cases, modifications in the constants mw and mb are necessary as they must be equal to one third of their previous values. The solution to increase the quality of the images is found but there is still one question. As the system is supposed to work without a specialized user, how could it define when to make the changes? In other means, how could it know that the segmented image does not have the necessary quality? For these cases, image quality measures are applied and adjusted to the images in study.
2.2 Image Quality Analysis The definition of image quality metrics is subject of several studies from image processing researches. They come from subjective measures as Mean Opinion Score (MOS) to objective ones as Peak Signal-to-Noise Ratio (PSNR), Mean Square Error (MSE) and Analysis of Variance (ANOVA).
Image Segmentation of Historical Documents: Using a Quality Index
213
In means to determine MOS, a number of subjects rate the quality of images and rates than as follows: 1 for bad; 2 for poor; 3 for fair; 4 for good and 5 for excellent. The MOS is the arithmetic mean of all the individual scores, and can range from 1 (the worst) to 5 (the best). PSNR is an evaluation pixel-to-pixel which does not consider the complete structure of an image as it is based on the MSE. Statistical measures can be a good choice for this problem. In this case, the ANOVA is appropriated. Here, the main question is the definition of the measures to be analyzed between the images compared and then introduced in the analyzer. In [8], Haralick defines a set of fourteen features that must be evaluated when textures are at study. This includes Entropy, Contrast, Correlation, Variance among others. In the same way, [9] proposes the evaluation of some of these measures using the GrayLevel Co-occurrence Matrix (GLCM). [10] defines “ image quality in terms of the satisfaction of two requirements: usefulness (i.e. discriminality of image content) and naturalness (identifiability of image content)”. The main problem of these metrics is that they make analysis between two images (a reference image and a target one). As we are dealing with historical documents there are no original image to compare. In this sense, our case is what is defined in [11] as a “blind image quality assessment”. [12] proposes a method for restoration of historical documents and it evaluates its method based on the concepts of precision and recall as defined in [13] and accuracy defined in [14] and [15]. These measures, however, claim to have a user analyzing visually the images which is not appropriate for our application. For typed documents, one can define the quality of the segmented image based on the hit rate of OCR tools. In our case, the OmniPage is the software used as it was analyzed and compared to other similar softwares [16]. An index for image quality measure, Q, is defined in [17] in terms of the linear correlation coefficient and the similarities between the mean and variance of two images. In fact, based on [13], Z.Wang et al. defined in [17] an index of fidelity as it is an evaluation of the similarities between two images. Q is equal to 1 if the images are equal. This index is used in the solution of part of our problem. As explained before, different images are reached with different choices of the logarithmic base as it generates different threshold values. Two images are then generated with the two possible values of this base (width by height or 256) and the index Q is evaluated for them both. These images are generated as greyscale with white backgrounds as described before. The image which generates the value more distant from 1 is the selected image as the other one is closest to the original image. This chooses between the two bases. So, in fact, this fidelity index can be used to guarantee the quality of the image. After this, all non-white pixels of the selected image are converted to black. For the cases where this modification of the base is not sufficient, when the values of mw and mb must be changed, the solution can not come from a fidelity index as both images generated by the two logarithmic bases are inadequate. As the algorithm is applied only to images of documents, it is correct to suppose that the most part of the image is composed by pixels that represent the paper not the ink. The correction of the constants will be applied if the segmented image has more than 50% of pixels classified as ink (black pixels in a bi-level image).
214
C.A.B. de Mello
3 Results The entropy-based algorithms analyzed were applied in the sample document presented in Fig. 1-left. The generated files are presented in Fig. 2 next. The new proposed algorithm was also applied to that sample document of Fig. 1. Its final image can be seen in Fig.2 right next. In this figure, it was applied with a logarithmic base of width by height achieving a high quality image. Fig. 3 presents an example of a document where this base is not the best one. In this case, it was evaluated the image quality measure to choose the best logarithmic base. The values of Q for the images of Figs. 3-center and right are: Q1 = 0.32 and Q2 = 0.22, where Q1 is the fidelity index evaluated between the original image (Fig. 3-left) and the image segmented with logarithmic base equals to the product of the dimensions of the images (Fig. 3-center) and Q2 is the fidelity index with the image segmented with a 256 logarithm base presented in Fig. 3-right. The base change modifies the threshold value from 79 to 144. In the case presented in Fig. 4 next, the segmented images with both logarithmic bases used presented more than 70% of black pixels. This indicates that a correction of the multiplicative factors mw and mb must be done. The segmented image after this modification can be seen in Fig. 4-bottom-right.
Fig. 2. Application of entropy-based segmentation algorithms to the sample document of Fig. 1. (left) Pun algorithm; (center-left) Kapur et al algorithm; (center-right) Johannsen and Billie algorithm and (right) new proposed algorithm.
Fig. 3. Sample document (left), (center) the bi-level segmented image with logarithmic base of width by height which is not satisfactory and (right) a change in the logarithmic base to 256 defines a new threshold value and creates an image with less noisy.
Image Segmentation of Historical Documents: Using a Quality Index
215
Fig. 4. (top-left) Original document, (top-right) segmented version with logarithmic base equals to width by height, (bottom-left) segmented image for base equals to 256 and (bottomright) application of the algorithm to the original document (in grayscale) with changes in the values of mw and mb.
4 Conclusions This paper presents a new entropy-based segmentation algorithm for images of historical documents. More than use the entropy of the original image to define the threshold value, the algorithm proposes changes in the logarithmic base to achieve better quality images. Using the definition of entropy, two bases are proposed 256 (corresponding to the maximum number of possible gray levels) and the area of the image (corresponding to the total number of pixels in the image). The correct base is defined using a fidelity index between both segmented images and the original grayscaled one. With the proposed variations and the use of a fidelity index, the segmented images can be evaluated by quantitative means. It was also analyzed the criterions for quality evaluation for images of historical documents as there is no original image to use and the reference one is aged.
216
C.A.B. de Mello
Acknowledgments. This research is partially sponsored by CNPq (55.0017/2003-8), FACEPE (Bolsa de Incentivo Tecnológico) and UPE.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
J.R.Parker. Algorithms for Image Processing and Computer Vision. J.Wiley&Sons, 1997. R.D.Lins, M.S.Guimarães, L.R.França, L.G.Rosa. “An Environment for Processing Images of Historical Documents”. Microprocessing & Microprogramming, North-Holland, 1995. C.Shannon. “A Mathematical Theory of Communication”. Bell System Technology Journal, vol. 27, pp. 370-423, 623-656, 1948. T.Pun, “Entropic Thresholding, A New Approach”, C.Graphics and Image Proc., 1981. J.N.Kapur, P.K.Sahoo and A.K.C.Wong. “A New Method for Gray-Level Picture Thresholding using the Entropy of the Histogram”, Computer Vision, Graphics and Image Processing, 29(3), 1985. G.Johannsen and J.Bille, “A Threshold Selection Method using Information Measures”, Proceedings, 6th Int. Conf. Pattern Recognition, Munich, Germany, pp.140-143, 1982. J.N.Kapur. Measures of Information and their Applications, J.Wiley & Sons, 1994. R.Haralick, K.Shanmugam and I.Dinstein. Textural Features for Image Classification. IEEE Trans. on Systems, Man and Cybernetics, November, 1973. K.Franke, O.Bunnemeyer, T.Sy. Ink texture analysis for writer identification. 8th Int. Workshop on Frontiers in Handwriting Recognition, Ontario, Canada, 2002. T.J.W.M.Janssen. “Understanding image quality”, Int. Conf. on Image Processing, 2001. Xin Li. “Blind Image Quality Assessment”, IEEE Int. Conf. on Image Processing (ICIP ’02), New York, USA, 2002. Chew Lim Tan Cao and R.P. Shen. “Restoration of archival documents using a wavelet technique”, IEEE Trans. on Patt. Analysis and Machine Intelligence, 2002, pp. 13991404. G.Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley Pub. co., 1994. D.Michie, D.Spiegelhalter and C.Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. M.Junker, R.Hoch and A.Dengel. “On the evaluation of document analysis components by recall, precision, and accuracy”, Int. Conf. on Doc. Analysis and Recognition, 1999. C.A.B.Mello, R.D.Lins. “A Comparative Study on Commercial OCR Tools”. Vision Interface’99, Canada, 1999. Z.Wang, A.C.Bovik and Ligang Lu. “Why is image quality assessment so difficult?”, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’02), Florida, USA, 2002.
A Complete System for Detection and Identification of Tabular Structures from Document Images S. Mandal1, S.P. Chowdhury1, A.K. Das1, and Bhabatosh Chanda2 1
CST Department, B. E. College, Howrah 711 103, INDIA {sekhar,shyama,amit}@cs.becs.ac.in
2
ECSU, Indian Statistical Institute, Kolkata 700 035, INDIA
[email protected]
Abstract. The requirement of detection and identification of the tabular structures from a Document Image is crucial to any DIA and digital library system. Here in this paper we report a generic approach to detect any tabular structure in any form that may be present in the document as table, tabular displayed math, table of contents page and index page without any OCR.
1
Introduction
With the maturity in Document image analysis and document image understanding many practical systems are coming up to manage the paper document in electronic form to facilitates indexing, viewing, printing and extracting the intended portions. Such systems include digital document libraries, vectorization of engineering drawings and form processing systems [1,2,3,4] to name a few. Common task for a typical document image analysis (DIA) system starts with skew correction and identification of the constituent parts of the document image to text, graphics, half-tones etc. The graphics portion may be vectorised and text portion may be put to OCR. However any tabular structure that may be present in the document e.g., tables, some of the displayed math-zones and special cases like table of contents (TOC) page and index page has to be treated separately. Because direct OCR will simply fail in case of math-zones while tables, and TOC or index pages need special treatment as the fields are interrelated and individually carry a little sense. We contend that a unified approach to separate out any tabular portion and declaring it as index or TOC page or regular table or tabular math-zone would make the segmentation task easier and a DIA system more reliable and robust irrespective of the variety in its input. It could also be important for document image library where it helps data mining and aggregation [5].
2
Past Work
Table detection and segmentation is done by many researchers [6,7,8,9]. Watanabe et al. [7] have proposed a tree structure to capture the structures of various A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 217–225, 2004. © Springer-Verlag Berlin Heidelberg 2004
218
S. Mandal et al.
kinds of tables. Table structure detection is also reported by [6]. Zuyev [10] described a table grid and defined the compound cell and simple cell of a table based on table grid. Node property matrix is used by Tanaka [11] in processing of irregular rule lines and generation of HTML files. Unknown table structure analysis is proposed by Belaid [12]. Tersteegen et al proposed a system for extraction of tabular structure with the help of predefined reference table [13] and Tsuruoka [9] proposed a segmentation method for complex tables including rule lines and omitted rule lines. In [8] a technique is described to separate out tables and headings present in document images. With the maturity in Document image analysis and document image understanding digital document libraries are coming up [3,4]. As a result the identification/segmentation of the TOC and Index page from the scanned document has attracted researchers [14,15] to put forward a couple of schemes to do the same. It has been observed from the existing literatures that most of the works are directed toward higher level understanding of the TOC pages so as to extract the structural information and representing the whole to a suitable meta-structure like HTML or XML etc. [14]. Many researchers in recent years have developed a number of approaches [16,17,18,19] for mathematical document processing with a focus on mathematical formula recognition and understanding. They assume that the math-zones are already segmented from the document image manually or through semi or fully automated segmentation logic. Fateman et al. [18] proposes a scheme which utilises character size and font information etc. to identify all connected components. Math segmentation is done in [20] through physical and logical segmentation using spatial characteristics of the math zone as well as identifying some math-symbols. No OCR have been used in [21] to segment both displayed-math and in-line-math portions.
3
Proposed Work
The objective of the present work is to find out tabular structure of any kind through simple checks on some of the structural properties of the document images thereby avoiding costly solutions required to detect and identify Tables, TOC and Index, tabular math-zones individually as discussed in section 2. This work is the continuation of our earlier work on segmentation where the document image contains text, graphics, half-tones, math-zones, tables and headings. Tabular structure detection starts with the gray scale image of the page. The half-tone is removed first [22]. The image is then binarised [23] and skew corrected [24]. This facilitates segmentation of the tabular structures from the residue image. Here we start with the the descriptions of the tabular structures in the form of i) Tables, ii) TOC page, iii) INDEX page and iv) Tabular MATH-ZONE (in the form of part of a page or full page) that may be present in a document page.
A Complete System for Detection and Identification of Tabular Structures
219
Fig. 1. Examples of tabular structures (a) TOC-I; (b) TOC-II; (c) Index page and (d) Tabular math-zones.
Fig. 1 (a) to (d) show the examples of tabular structures other than the common tables.
3.1
Observation
TABLE: A table consists of at least 2 rows and 2 columns and it may be fully or partly embedded in boxes formed by horizontal and vertical row and column separators. In its simplest form there must be a single horizontal line possibly dividing the header row and the data row of a table. A table may occupy a portion of the document page or may spread over the full page; more than one table in a page is also possible. TOC: TOC page may be of two types; i) TOC-I has right aligned page numbers which yields a hump in the right-most side of its vertical projection and ii) TOCII has non-aligned page numbers appearing just after the title field. TOC-II is characterised by minimum variation in the number of characters present in the last word (i.e. page number) in each line. An example of two types of TOC pages that are commonly used is shown in Fig. 1(a) and (b). INDEX: An index page is characterised by the presence of a number of words delimited with commas in a majority of text lines. Index page also has at least two columns. Note that index page has an inherent logical tabular structure whose first field is the keyword and second field is a set of page numbers. A typical example is shown in Fig. 1(c). MATH-ZONE: Displayed math with tabular structures contains at least one or more special characters/symbols (indicating the beginning or end of matrices, determinants, conditional equations etc.) which are spread over multiple text lines as shown in Fig. 1(d). A tabular structure is characterised by the presence of distinct humps in the vertical projection of the components in a specified area of a document image. For a typical text area vertical projection yields a single unique hump with moderate variations in the projection profile. Moreover this hump will be spread over the whole width of the window being scanned. On the other hand we will
220
S. Mandal et al.
get multiple humps in the cases of i) Tables, ii) Tabular display math-zone, iii) TOCs and index pages. In our approach we mark anything that may be considered as tabular. Note that this includes multiple column normal text pages which can easily be excluded, in the identification stage, by checking the spread of a couple of humps separated by one or more short narrow stripe of white space.
3.2
Detection
Detection of tabular structure is done by detection and counting multiple humps in the vertical projection profile within a specified window moving over the portion (or whole) of the document image. However, if we just take the vertical projection directly on the document image then due to the irregular pattern of the characters and words hump detection will be a difficult job. It will also require some sort of heuristics for correct detection of the hump-span. The problem is alleviated by taking the vertical projection of the solid clusters instead of the irregular original document page portion of the intended area. It is obvious that we are forming the clusters from the original document page. The formation of clusters is mathematically established next using conventional notations. Consider a binary image which consists of connected components as defined in standard text [25] with their usual meanings. Let and be the 4 extreme points of the connected component in 4 directions (i.e.; left, right, top and bottom) respectively. Suppose function F guarantees that the two connected components are in the same text line. Then function F may be represented as
Note that for any three connected components and if and then i.e., transitive property holds on relation F for connected components. Cluster formation requires information on inter-word gap. This may be obtained from the histogram H of distance D between two consecutive connected components and The distance function D is defined for computing the horizontal distance between any two consecutive connected components, as below where
such that
AND
The histogram H will be of the intermediate character gap of two consecutive characters. It may be noted that if there is only one font in the document then we will get two distinct humps in H; first one for character gap in a word and the second one for the word gap. If there is more than one font we will find
A Complete System for Detection and Identification of Tabular Structures
221
few other humps however first hump will be most prominent followed by the second hump. Our intention is to find out the word gap in the normal text in a document page so that we could combine the consecutive words into a single cluster. For doing so we are taking the mean and standard deviation of the second hump. Morphological closing operation with an structuring element of area will form the clusters denoted as (where Clusters will be formed from the connected components following the given two conditions as described below: 1. If there are two connected components the relations
then 2.
and
and
having
should belong to the same cluster. AND
To examine the hump structures a window is created, primarily, around the horizontal line or lines present in the image and moved in up/down directions. Such an example is given in Fig. 2(a) showing the selected window on the original image at a particular position. It may be noted that projection profile is not taken on the original image but on its clustered version. Fig. 2 (b) shows the detected tabular portion of the same image.
Fig. 2. Example of sliding window for table detection and histogram required to detect math-zones; (a) sliding window around the horizontal line; (b) Corresponding tabular area; (c) Histogram of inter character and word gap of the fig. l(d).
Detection of tabular math-zone is complicated than finding out other textualtabular portions [5] and are elaborated here. First we create a histogram of the distance between two neighbouring components in a single text line as shown in Fig. 2(c). Tabular math-zone detection needs a special type of window to be created by detecting tall components (usually mathematical symbols like curly and square braces) which are spread over a couple of text lines. We replace the
222
S. Mandal et al.
tall component by an equivalent vertical line. A window in this case will be created with height equal to the height of the vertical line (see Fig. 3(a)) and extended to the right edge of the text region/column or another vertical line. Here the scanning will follow from the top to the bottom of the window.
Fig. 3. Tabular math-zone segmentation; (a) windows around math-zones of Fig. 1(d) using tall components/symbols; (b) vertical projections inside the windows; (c) cluster formation by morphological operation and (d) vertical projection of the clusters inside the windows.
Fig 3(b) shows an example where the vertical projections inside the windows are taken directly from the original (non-clustered) image while Fig. 3(c) and (d) shows cluster formation, with the help of the histogram shown in Fig. 2(c), and vertical projections on the clustered version respectively. The need for the clustering for hump detection is clearly evident from the figures. Finally the detection steps in short is listed below. 1. Separate horizontal lines by morphological open operations using horizontal line structuring element (SE). 2. Tall symbols/characters are next isolated. This is done by Component labelling of the image and for each component we search for at least two text lines in either side by taking horizontal projection profiles. If we get multiple text lines in one side and if the component is spread over those lines then it is considered as a tall component which may be a part of displayed math in tabular form. 3. Clusters are formed through morphological operations on the original image. 4. For each horizontal line (obtained in step 1) sliding window is created around. Sliding windows are then moved up and down and vertical projections on the clustered version is taken till the number of humps obtained from the initial projection is maintained. If the projection profile contains a number of humps we mark that area as tabular. 5. Another set of windows are formed around the tall symbols (obtained in step 2). It may be noted that the tall symbols may occur in pair or alone. Vertical projection of the clustered version is taken inside the window. A number of humps inside any window mark that region as tabular area.
A Complete System for Detection and Identification of Tabular Structures
223
6. If no window is created in step 4 and step 5 then a window is formed around the overall document and corresponding vertical projection is taken on the clustered image.
3.3
Identification
The tabular areas are all individually checked to identify their types. The steps to identify different types of tabular portions are elaborated below. For any tabular portion: i) if it contains at least one horizontal line then it is a table. ii) if it contains two vertical tall components at left and right or only on the left then it is a tabular display math-zone provided it has multiple column-structure. Note that a graphical component may have a vertical line but there will be no hump-structure. iii) if the tabular portion is spreaded over the full page then it is a a) A TOC page if we get a tall narrow hump in the extreme right or a small narrow hump in the extreme left with minimum variation of the width of the last word in each line. OR b) An index page if contains a number of commas at the right of the index word with uneven horizontal spread of text lines.
4
Experimental Results
Here we present results (see Table 1) to detect and identify tabular portions of document images. It may be noted that in this work we have given emphasis on a unified approach on the detection and identification of the tabular structures. Segmentation of individual fields for TOC and index pages etc. are not done. For the dataset we have used document pages from University of Washington’s document image database (UW-I, II, III) and our own collection of TOCs’ and index pages from text books. The dataset contains, besides around 200 text pages (with and without tables) of different kinds along with 76 pages with tabular math-zones, 143 TOCs and 124 index pages. We could identify 120 TOCs and all the 124 index pages. The tests are done in a COMPAQ DS 20E server running digital UNIX. All the programs are written in C and the average time for detection and identification of the tabular structures for a page is about 5 seconds including the preprocessing operations.
5
Conclusion
We conclude this paper by pointing out the uniqueness of our work; which are (i) segmentation and identification of any type of tabular structure present in the document. This includes normal tables, tabular display math-zones, the table of content (TOC) and index pages without any character recognition steps; (ii) low computation cost; and (iii) detection of TOCs and index pages without any explicit training on the font size and style used in a page.
224
S. Mandal et al.
References 1. Liu, J., Wu, X.: Description and recognition of form and automated form data entry. In: Proc. Third Int. Conf. on Document Analysis and Recognition, ICDAR’95. (1995) 579–582 2. Joseph, S.H.: Processing of engineering line drawings for automatic input to cad. Pattern Recognition, Vol. 22 (1989) 1–11 3. Katsura, E., Takasu, A., Hara, S., Aizawa, A.: Design considerations for capturing an electronic library. Information Services and Use (1992) 99–112 4. Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proc. ICDAR 1995. (1995) 163–166 5. Baird, H.S.: Digital libraries and document image analysis. In: Proc. 7th International Conference on Document Image Analysis; Vol. I, Los Alamitos, California, IEEE Computer Society (2003) 2–14 6. Chandran, S., Balasubramanian, S., Gandhi, T., Prasad, A., Kasturi, R., Chhabra, A.: Structure recognition and information extraction from tabular documents. IJIST Vol. 7 (1996) 289–303 7. Watanabe, T., Luo, Q.L., Sugie, N.: Layout recognition of multi-kinds of tableform documents, Vol. 17. IEEE transactions on Pattern Analysis and Machine Intelligence (1995) 432–446
A Complete System for Detection and Identification of Tabular Structures
225
8. Das, A.K., Chanda, B.: Detection of tables and headings from document image: A morphological approach. In: Int. Conf. on Computational linguistics, Speech and Document Processing (ICCLSDP’98); Feb. 18–20,Calcutta, India. (1998) A57–A64 9. Tsuruoka, S., Takao, K., Tanaka, T., Yoshikawa, T., Shinogi, T.: Region segmentation for table image with unknown complex structure. In: Proc. of ICDAR’01. (2001) 709– 713 10. Zuyev, K.: Table image image segmentation. In: Proc. ICDAR97, Ulm, Germany. (1997) 705–707 11. Tanaka, T., Tsuruoka, S.: Table form document understanding using node classification method and HTML document generation. In: Proc. of 3rd IAPR Workshop on Document Analysis Systems (DAS ’98), Nagano, Japan (1998) 157–158 12. Belaid, Y., Panchevre, J.L., Belaid, A.: Form analysis by neural classification of cells. In: Proc. of 3rd IAPR Workshop on Document Analysis Systems (DAS’98), Nagano, Japan (1998) 69–78 13. Tersteegen, W.T., Wenzel, C.: Scantab: Table recognition by reference tables. In: Proc. of Third IAPR workshop on Document Analysis Systems (DAS’98), Nagano, Japan (1998) 356–365 14. Tsuruoka, S., Hirano, C., Yoshikawa, T., Shinogi, T.: Image-based structure analysis for a table of contents and conversion to a XML documents. In: Workshop on document layout interpretation and its application (DLIA 2001), Sep. 9, 2001, Seattle, Washington, USA (2001) 15. Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: Automated detection and segmentation of table of contents page and index pages from document images. In: 12th International Conf. on Image Analysis and Processing, ICIAP’03), 17-19 Sept. 2003, Mantova, Italy (2003) 213–218 16. Belaid, A., Haton, J.P.: A syntactic approach for handwritten mathematical formula recognition. IEEE Trans. PAMI, Vol. 6. (1984) 105–111 17. Ha, J., Haralick, R.M., Phillips, I.T.: Understanding mathematical expressions from document images. In: Proc. of ICDAR’95, Canada (1995) 956–959 18. Fateman, R., Tokuyasu, T., Berman, B., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. Visual Commn.. And Image Representation, Vol. 7, no 1 (1996) 2–15 19. Kacem, A., Belaid, A., Ahmed, M.B.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. IJDAR, Vol 4, no 2 (2001) 97–108 20. Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proc. of ICDAR’99, India (1999) 116–122 21. Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition. Volume II., 3-6 August ’03, Edinburgh, U. K. (2003) 755–759 22. Das, A.K., Chanda, B.: Text segmentation from document images: A morphological approach. Journal of Institute of Engineers (I), Vol. 77, November (1996) 50–56 23. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Trans. SMC, Vol. 9, No. 1 (1979) 62–66 24. Das, A.K., Chanda, B.: A fast algorithm for skew detection of document images using morphology. Intl. J. of Document Analysis and Recognition, Vol. 4 (2001) 109–114 25. Gonzalez, R.C., Wood, R.: Digital Image Processing. Addison-Wesley, Reading, Mass. (1992)
Underline Removal on Old Documents João R. Caldas Pinto1, Pedro Pina2, Lourenço Bandeira1, Luís Pimentel1, and Mário Ramalho1 1 IDMEC, Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa Portugal
{jcpinto,lpcbandeira,mar}@dem.ist.utl.pt 2
CVRM / Geo-Systems Centre, Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal
[email protected]
Abstract. In this paper we tackle the specific problem of handwritten underline removal. The presence of these underlines is very frequent even in rare books and its removal a desired goal of the National Libraries in their process of building digital libraries. Two different techniques are applied and compared. One is based on mathematical morphology and the other on a recent published technique for line detection. Both were applied to books of XVI Century and the results evaluated.
1 Introduction The universal availability and the on-demand supply of digital duplicates of large amounts of written text are two inevitable paths of the future. To achieve this goal books have to be first digitalized. Due to the rarity of most of them this operation should be carried out only once, so we obtain high resolution color images. This large dimension images will be the future source data. The first natural possibility of the libraries is to make available in the Internet reduced versions of these images. However, not only the size is a problem. Also the visual appearance of the books can be very poor due to the aging process. Spots due to humidity, marks resulting from the ink that goes through the paper, rubber stamps, strokes of pen and underlines are features that, in general, are not desirable. By the other hand, it is an ultimate goal of the libraries to obtain ASCII versions of the books, what means to perform optical character recognition (OCR). However, if this is a well-established process for images of texts written with modern fonts and on clean paper, it is still a challenging problem for degraded old documents. Thus, all the contributions to improve quality of the images are important to the success of this process. The OCR problems that have to be faced when dealing with old documents have already been discussed in [1] and [2]. The main application relies on a commercial OCR package, ABBYY FineReader Engine [3]. This package, besides its OCR output, provides methods to image geometric corrections and binarization. A final binary image is obtained and the OCR operation is then performed. Fig. 1 displays a diagram representing schematically the organization of the developed system and the streamlined design connecting its components. Because we have access to the intermediate A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 226–233, 2004. © Springer-Verlag Berlin Heidelberg 2004
Underline Removal on Old Documents
227
binary image it is our proposal to process this image outside the FineReader and feed this engine with the processed image for the last step of the OCR process. This is also represented in Fig. 1. Indeed, we can observe that some undesirable features, like spots and underlines, are presented in binary image produced by the FineReader. These underlines in most cases induce in error the OCR about the real dimension of the characters and words, leading to inaccurate recognition. Indeed, to perform automatic text recognition every character and word has to be isolated. Summarizing, the motivation to remove the underlines in a written text are twofold: to improve the general appearance of the book and to increase the accuracy of the OCR task.
Fig. 1. System overview diagram
In this paper we address the problem of underline removal in binary images. We choose to work with these images because we assume that other techniques were used to remove other degrading elements like spots and to perform some geometrical corrections, if necessary, due to the acquisition process, either using FineReader or other algorithms already available. This is not a limitation because it is always possible to use the extracted information concerning the underlines position, to remove them from the original color images. When we refer underline, it is assumed that is handwritten, with all its characteristics of irregularity and variable slope (see example of a page in Fig. 2, where we present the original grey level and binary images). Otherwise the problem would be trivial given the effective line detection techniques present in literature [4]. Our first believe to solve the proposed problem was to use mathematical morphology due to its well known capacity to deal with this type of problems [5]. We decide also to use a recently introduced promising line detection technique [6] whose characteristics seemed to be easily adapted to more irregular straight lines. This paper is organized as follows. Section 2 describes the application of mathematical morphology to remove handwritten underlines. Section 3 presents a recently introduced technique (small eigenvalue) for line detection that authors claim to be very robust and discusses its applicability to our problem. Section 4 describes and discusses the results obtained. Finally, Section 5 presents the conclusions and possible future work.
228
J.R. Caldas Pinto et al.
Fig. 2. Example of a printed page: (a) grey level and (b) after binarization.
2 Mathematical Morphology Mathematical morphology is a theory created in the middle 1960’s by Georges Matheron and Jean Serra with the objective of disposing of a tool to quantify the geometric nature of structures [5]. Its recognition as a powerful image analysis tool has been increasing since then, testified by the wider range of application fields where it is applied. Thus, in the document analysis field there exist also some works that deserve to be referenced: image segmentation [7-10], removal of background spots [11] and separation of text from non-text (figures, capital letters, stripes) in XVI century printed books [12]. In the present context of underline removal there is no published or similar study using mathematical morphology. Our approach is based on the exploitation of the near horizontal disposition of these handwritten lines. It appeals to directional [13] but also to other morphological transforms [14]. In the first step and before proceeding to the line removal, it is necessary to create a mask that contains only the text, in order to filter out all the noise located outside that mask. This objective is achieved by applying a closing with the isotropic structuring element B of size to the binary image X, followed by an erosion (R) filtering. The resulting set Y is obtained through the sequence:
Underline Removal on Old Documents
229
The handwritten underlines are marked by applying directional openings with a segment or straight line l as structuring element in the horizontal direction. Only horizontal or sub-horizontal structures can resist, totally or partially, to this transform. The partial directional reconstruction DR of the remaining regions permits to recover the underlines: the geodesic dilation uses a directional structuring element and is applied till the idempotence is reached. The set difference with the image Y permits to filter out the handwritten underlines. In order to recover the regions of the letters suppressed by the elimination of the handwritten underlines (in these regions there exists a superimposition between letters and underlines), a dilation in the vertical direction using a segment as structuring element is applied. It gives the possibility of recuperating partially these common regions without reconstructing again the handwritten structures. This sequence is summed up in the following equation:
3 Line Detection Based on Small Eigenvalues Guru et al. [5] introduce a novel algorithm for transforming a given edge image into an image in which only linear edge pixels are enhanced, to extract straight lines present in the edge image. The detection of straight lines is based on the correspondence of certain geometrical properties with the size of the small eigenvalues of the covariance matrix. For an object with symmetric shape like a straight segment the computed eigenvalue is minimum. Given a edge image, a window with a size k×k will scan the entire image pixel by pixel for mask processing. When the selected central pixel is an edge point, all the points connected to it are recursively determined, recurring to a window 3×3 positioned on the pixels belonging to the mask k×k. Once registered the positions of these m connected points on the image and granted the existence of a minimum of connected points (center point is not noise) the covariance matrix is constructed from the relative positions to the center point of the mask by the following expressions:
where and are respectively the mean values of x and y coordinates. The smaller eigenvalue associated to the set of pixels selected by the mask given by:
For each point previously registered, the present computed is compared to the existing value on the respective position of the small eigenvalue matrix, a matrix with
230
J.R. Caldas Pinto et al.
the size of the input image and initialized, in the beginning of the process, with a high value. If the computed is inferior, the value on memory is updated, otherwise is maintained. Once all the points selected by the mask have been processed, the center of the mask is moved to the next pixel repeating this process until the center reaches the last pixel of the input image. The output image is finally obtained by defining the maximum value for that a pixel represented in the small eigenvalue matrix can have to be considered pertaining to a straight line.
4 Results In order to test both of the proposed techniques, mathematical morphology (MM) and small eigenvalue (SEV), several experiments were conducted. The performance of each technique was evaluated by visually inspecting the preprocessed word images and also by comparison of the OCR results returned by the FineReader together with its standard binarization procedure (FRB).
Fig. 3. Underline removal results using: (a) Mathematical Morphology and (b) Small Eigenvalue Approaches.
An overall inspection of the images obtained permits to conclude that both methods are very satisfactory, since most of the handwritten underlines are removed. An example that illustrates the outputs of both approaches is presented in Fig. 3 for the same book page. Anyhow there are differences in the visual appearance between both approaches that must be pointed out. The MM approach produces normally cleaner images with a better visual aspect. However, in some situations, there exists an overfiltering, where some smaller structures are suppressed. On the contrary, the SEV approach conducts to less cleaner images, but still with correct results, where the “damages” introduced are less effective than in the MM case.
Underline Removal on Old Documents
231
The MM approach assumes that the underlines are near-horizontal to horizontal structures. In addition, the underlines must be wider that the length of the larger character in order to be detected. The application of directional openings in the horizontal direction suppresses the regions not presenting this orientation. The choice of the size of the segment (structuring element) is linked to the dimension of the fonts of the text, having to be a little wider that the larger character. In some situations adjacent characters are connected and can produce larger segments that are able to resist to the previous opening transform and that appear, in a first stage, in the underline set. Thus, these regions although contain segment larger that the dimension of the structuring element are incorrectly removed, from the point of view of the objective pursued, from the initial image. This approach can be considered robust with good results, but presenting a drawback, related to the suppression of small structures that are valuable information of the text, like the dots of the “i” and “j” letters. Although only in some few cases this situation produces wrong character recognitions it is an issue to be improved in the future. On the other hand, the application of the SEV algorithm on the selection of the underlines of an alphanumeric text reveals certain defects attached to the fact that some letters (e.g. “i”, “I”, “f,” “F”, “d”, “D”) have the appearance of linear edge pixels and that others (e.g. “O”, “o”, “G”) exhibit symmetric properties. To prevent the elimination of points from the text within the underlines, the output image obtained must be scanned again, this time by a rectangular window (e.g., n×5n). If the number of points connected to the center of this mask is bigger than an established value (e.g., 10n) the set of points is automatically classified as underline and not text. In the opposite case it must be verified if a few pixels at left or right in the output image were detected points belonging to linear segments, assuring that small traces due to the consecutive interception of slender letters are also classified as underlines. Once the fine parameters k and n are found for the printing types and underlines, the algorithm reveals to be extremely sensitive in what refers to preserving the original text. Its effectiveness is a maximum for thin underlines (just one passage) presenting inspiring results. The detection of larger underlines, only suitable for larger types too, requires greater values for k and n naturally increasing the computational time. A solution to accelerate the process may reside in the reduction of the overlapping scan of one pixel by the different windows result from the pixel-by-pixel scanning. Several examples illustrating the application of both approaches (MM and SEV) as well as the standard FRB approach are presented in Table 1. The images presented for some local regions of the text reinforce what has been said before. Also in the same table are presented the results obtained for each image after the application of the OCR. In addition, a sample page was submitted to FineReader, before and after the underline removal algorithms were applied. The outcome is resumed in Table 2, for a test set of 197 words. The main conclusion is that the results obtained by one or the other introduced approaches give always bests results than the standard approach.
5 Conclusions Removing the handwritten underlines on old documents is an important issue in two different applications: improving the visual appearance of degraded documents in
232
J.R. Caldas Pinto et al.
order to make them available in the Internet and contributing to a better OCR performance when applied to old documents, despite being a field with other challenging problems.
Straight underlines, generally referred simply as lines, can be easily removed by a set of well-known techniques, which fail in most situations of handwritten sets. Thus, the application in this paper of two adapted techniques to handwritten underlines leads to more robust techniques and produce better results than a standard procedure. As future work, we intend to improve both approaches individually in order to suppress its present drawbacks: the filtering by the MM has to be less destructive in some character features and the SEV has to be more effective in underline detection and removal. The integration of both approaches, incorporating its most positive aspects, is also a major challenge we intend to pursue. Acknowledgements. This work was partly supported by: the “Programa de Financiamento Plurianual de Unidades de I&D (POCTI), do Quadro Comunitário de Apoio III”; by program FEDER, by the FCT project POSI/SRI/41201/2001; and “Programa do FSE-UE, PRODEP III, no âmbito do III Quadro Comunitário de Apoio”. We also acknowledge the continuous support of BN-Biblioteca Nacional and the permission to access large sets of documents.
Underline Removal on Old Documents
233
References 1.
2.
3. 4. 5. 6. 7.
8.
9. 10.
11. 12.
13. 14.
Ribeiro C.S., Gil J.M., Caldas Pinto J.R, Sousa J.M.: Ancient document recognition using fuzzy methods. In: Proceedings of the 4th international Workshop on Pattern Recognition in Information Systems, Porto, Portugal (2004) 98-107 Ribeiro C.S., Gil J.M., Caldas Pinto J.R, Sousa J.M.: Ancient word indexing using fuzzy methods. In: Proceedings of the 4th international Workshop on Pattern Recognition in Information Systems, Porto, Portugal (2004) 210-215 ABBYY FineReader Homepage, http://www.abbyy.com, ABBYY Software House Shapiro L.G., Stockman G.C.: Computer Vision. Prentice Hall, New Jersey (2001) Guru D.S., Shekar B.H. and Nagabhushan P.: A simple and robust line detection algorithm based on small eigenvalue analysis. Pattern Recognition Letters 25(1) (2004) 1-13 Serra J.: Image Analysis and Mathematical Morphology. Academic Press, London (1982) Agam G., Dinstein I.: Adaptive directional morphology with application to document analysis. In: Maragos, P., Schafer, R.W., Butt, M.A. (eds.), Mathematical Morphology and its Applications to Image and Signal Processing. Kluwer Acad. Pub., Boston (1996) 401408 Cumplido M., Montolio P., Gasull A.: Morphological preprocessing and binarization for OCR systems. In: Maragos, P., Schafer, R.W., Butt, M.A. (eds.), Mathematical Morphology and its Applications to Image and Signal Processing. Kluwer Academic Publishers, Boston (1996) 393-400 Hasan Y., Karam L.: Morphological text extraction from images. IEEE Transactions on Image Processing 9(11) (2000) 1978-1982 Mengucci M., Granado I.: Morphological segmentation of text and figures in Renaissance books (XVI Century). In: Goutsias J., Vincent L.& Bloomberg D.S. (eds.), Mathematical Morphology and its Applications to Image and Signal Processing, Boston, Kluwer Academic Publishers (2000) 397-404 Beucher S., Kozyrev S., Gorokhovik D.: Pré-traitement morphologique d’images de plis postaux. In: Actes CNED’96 Colloque National Sur L’Écrit et le Document, Nantes, France (1996) 133-140 Muge F., Granado I., Mengucci M., Pina P., Ramos V., Sirakov N, Caldas Pinto J.R., Marcolino A., Ramalho M., Vieira P., Amaral A.M.: Automatic feature extraction and recognition for digital access of books of the Renaissance. In: Borbinha J. & Baker Th. (eds.), Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, Vol. 1923. Springer, Berlin (2001) 1-13 Soille P., Talbot H.: Directional Morphological Filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11) (2001) 1313-1329 Soille P.: Morphological Image Analysis. edition. Springer, Berlin (2003)
A New Algorithm for Skew Detection in Images of Documents Rafael Dueire Lins and Bruno Tenório Ávila Universidade Federal de Pernambuco Recife - PE, Brazil {rdl, bta}@ee.ufpe.br
Abstract. Very frequently the digitalisation process of documents produce images rotated of small angles in relation to the original image axis. The skew introduced makes more difficult the visualisation of images by human users. Besides that, it increases the complexity of any sort of automatic image recognition, degrades the performance of OCR tools, increases the space needed for image storage, etc. Thus, skew correction is an important part of any document processing system being a matter of concern of researchers for almost two decades now. The search for faster and good quality solutions to this problem is still on. This paper presents an efficient algorithm for skew detection and correction of images of documents including non-textual graphical elements, such as pictures and tables. The new algorithm was tested in over 10,000 images yielding satisfactory performance. Keywords: Document Image Analysis, Skew detection, Rotated Images.
1 Introduction Organisations are moving at a fast pace from paper to electronic documents. However, large amounts of paper documents inherited from a recent past are still needed. Digitalisation of documents appears as a bridge over the gap of past and present technologies, organizing, indexing, storing, retrieving directly or making accessible through networks, and keeping the contents of documents for future generations. Scanners tend to be of widespread use for the digitalization of documents, today. One of the important problems in this field is that very often documents are not always correctly placed on the flat-bet scanner either manually by operators or by the automatic feeding device. This very frequent problem yields rotated images. For humans, rotated images are unpleasant for visualisation and introduce extra difficulty in text reading. For machine processing, image skew brings a number of problems that range from needing extra space for storage to making more error prone the recognition and transcription of the image by automatic OCR tools 15. These reasons make skew correction phases a common place in any environment for document processing. This problem has been addressed by researchers for over two decades
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 234–240, 2004. © Springer-Verlag Berlin Heidelberg 2004
A New Algorithm for Skew Detection in Images of Documents
235
now, but it remains a matter of interest today to find faster and more accurate solutions to it. The very comprehensive review by Cattoni and his colleagues 6 presents that skew estimation algorithms may be classified into eight different classes, according to the basic approach they adopt, as shown on Table 1.
All the referenced methods above work on monochromatic input images and have as underlying assumption that these approaches deal with documents in which text is arranged along parallel straight lines and that text represents most of the document image; performance of most algorithms often decay in the presence of other components like graphics or pictures. Furthermore, the major part of the skew detection schemes assume to deal with documents with a clearly dominant skew angle, and only a few methods can deal with documents containing multiple skews (the very recent paper 24 addresses this problem in the context of correcting the image produced by an uneven feeding device to a scanner, for instance). This paper presents a new algorithm for skew detection that works with complex documents, with a clear dominant skew direction between –45° to +45°, any image resolution and user defined accuracy up to resolution limitations. The proposed algorithm does not fully meet any of the classifications presented on Table 1. At most, one could say that it brings a vague resemblance to cross correlation algorithms.
236
R.D. Lins and B.T. Ávila
2 The New Algorithm Although the algorithm proposed herein may work in greyscale and colour documents, for the sake of simplicity it is assumed that documents are monochromatic. The basic idea of the new algorithm is to focus the analysis on the leftmost black points of documents. The average of their horizontal coordinate provides the axis taken as basis for rotation. The horizontal distance of the leftmost points to the reference axis (deviation) is used to calculate the rotation angle to be applied. The rotated image undergoes the same process until a rotation error is met (termination condition). In what follows the algorithm is detailed.
Fig. 1. Example of rotated image
Fig. 2. Image with selected points
The first step of the algorithm scans the image top-down looking for the leftmost points in the image. The points selected are shown in Figure 2. The distance between the leftmost point and the other selected points is measured. A depth parameter is used to avoid points that do not belong to the leftmost border of the document, but lay on rotated horizontal lines (Figure 3). The average of the abscissa, called A, of the remaining leftmost points is calculated. Point A determines the axis taken as basis for rotation correction (Figure 4).
Fig. 3. Points selected by depth parameters
Fig. 4. Calculus of rotation axis A
At this step the ten percent leftmost points are selected and their average, called is calculated (Figure 5). The distance of to A is taken as the tangent of the rotation angle The position of in relation to the middle point of A determines if rotation is clockwise in the superior half of A) or anti-clockwise in the inferior half of A).
A New Algorithm for Skew Detection in Images of Documents
237
Fig. 5. Calculus of rotation angle and direction
The image formed only by the original leftmost points is rotated of angle by using the following rotation transformation. Notice that using the image of leftmost points not only saves processing time as fewer points are rotated, but also preserves image quality by avoiding image degradation introduced successive transformation.
The new image is iterated successively until the new rotation angle calculated is less than the accepted by the application. This parameter depends also on image resolution. Another possibility is to apply the algorithm until it starts to diverge (the trend of the algorithm is to oscillate around a minimal point). The addition of all successive rotation angles yields the total rotation angle to be applied to the original image providing the needed skew correction.
3 Performance The algorithm presented herein was tested using images such as the ones presented on figures 6, 7, 8, and 9. Altogether, 200 page images were used to benchmark the detection of skew detection quality and space-time performance. These images are different pages extracted from articles on skew detection and correction in the literature, all of them referenced in this paper. The test images were rotated in both directions (clockwise and counter-clockwise) from 0.1° to 0.9° with steps of 0.1°, from 1° to 9° with steps of 1°, and finally from 10° to 45° with steps of 5°. This makes 10,400 test images in the benchmark set used. The skew detection algorithm was implemented in C and ran on an Intel Pentium IV, 2.4 GHz clock, 512MB RAM and HD IDE. Image files were in TIFF format. No image took longer than 30 ms to have its skew detected. The average processing time for skew detection of the set of benchmark images was 9ms in a left-to right scanning and 8ms in bottom-up scanning. The average number of iterations was 4 with maximum of 14 iterations for both scan orientations. The minimum error obtained by the algorithm was 0.004 degrees. Some of the tested images with rotation angle greater than 20°, produced unsatisfactory results at first and with only one iteration (the initial one). The threshold depth of the algorithm was widened allowing deeper analysis into the
238
R.D. Lins and B.T. Ávila
document image. This increases the number of points to handle and makes heavier the computational effort involved in skew calculation. In these cases, increasing the threshold depth from 20 pixels to 100 or 200 pixels yielded acceptable skew recognition.
Fig. 6. Image with pictures
Fig. 7. Multi skew documents
4 Conclusions and Lines for Further Work This paper presents a new algorithm for skew detection that works with complex documents, with a clear dominant skew direction between –45° to +45°, any image resolution and user defined accuracy up to resolution limitations. It is able to handle monochromatic images with non-textual (figures, graphs, tables, etc.) elements. It can be easily adapted to work with greyscale and colour images. The algorithm presented good space-time performance figures as skew calculations are performed in a small subset of points selected from the original image. The choice of a bottom-up scan has proved to be more time efficient and yield more precise results than the left-to-right scan, used to describe the algorithm herein. The proposed algorithm is being tuned for better accuracy of skew detection. The performance figures obtained so far are extremely optimistic. The choice of the points taken as reference to the skew correction is crucial. We are currently analysing documents to fine-tune the parameters of the algorithm. One of the strategies adopted, is to analyse the mode and variance of the points under consideration. The algorithm presented here only reduces the variance. In the fine-tuning, after variance is
A New Algorithm for Skew Detection in Images of Documents
239
minimised, starts a new step where the mode of the chosen points is calculated. Then the chosen points are rotated in steps of 0.1 degrees until reaching the maximum number of points at the mode. The use of component labelling 21 as a way to select the points to fine-tune the algorithm is under consideration. A few anomalous cases found in images rotated of angles greater than 20° were observed in the left-to-right scan of documents. These are being studied also. The results obtained in the finetuning of the algorithm presented here, are reported in reference 14. Acknowledgements. The research reported herein was partly sponsored by CNPqConselho Nacional de Desenvolvimento Científico e Tecnológico of the Brazilian Government, to whom the authors express their gratitude.
References 1.
2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13. 14. 15. 16.
H.K.Aghajan, B.H.Khalaj and T.Kailath. Estimation of skew angle in text-image analysis by SLIDE: subspace-based line detection. Machine Vision & Applications, 7:267-276, 1994. T. Akiyama and N. Hagita. Automated entry system for printed documents. Pattern Recognition, 23(11): 1141-1154, 1990. A.Bagdanov and J.Kanai. Projection profile based skew estimation alg. for JBIG comp. Images, 4th Int.Conf.Doc. Analysis and Recog., pp401-405, Aug. 1997. H.S. Baird. The skew angle of printed documents. Proc.Conf. Society of Photographic Scientists and Engineers, vol. 40, pp 21-24, 1987. D.S.Bloomberg, G.E..Kopec and L.Dasani. Measuring document image skew and orientation. Proceedings of IS&T/SPIE EI’95, pp 302-316, February 1995. R.Cattoni, T.Coianiz, S.Messelodi and C.M.Modena, Geometric Layout Analysis Techniques for Document Image Understanding:a Review, ITC_IRST, Jan. 1998. S. Chen and R.M. Haralick. An automatic algorithm for text skew estimation in document images using recursive morphological transforms. Proc.IEEE Int.Conf. on Image Processing, pp 139-143, 1994. G. Ciardiello, G. Scafuro, M.T. Degrandi, M.R. Spada and M.P. Roccotelli. An experimental system for office document handling and text recognition. Proc.9th International Conference on Pattern Recognition, vol. 2, pp 739-743, 1988. B. Gatos, N. Papamarkos and C. Chamzas. Skew detection and text line position determination in digitized documents. Patt. Recognition. 30(9):1505-1519, 1997. A. Hashizume, P.S. Yeh and A. Rosenfeld. A method of detecting the orientation of aligned components. Pattern Recognition Letters, 4:125-132, 1986. S. Hinds, J. Fisher and D. D’Amato. A document skew detection method using run-length encoding and the Hough Transform. Proc.10th. International Conference on Pattern Recognition, pp 464-468, June 1990. Y. Ishitani. Document skew detection based on local region complexity. Proc. of the 2nd Int. Conf. on Document Analysis and Recognition, pp 49-52, Oct. 1993. D.S. Le, G.R. Thoma and H. Wechsler. Automated page orientation and skew angle detection for binary document images. Pattern Recognition, 27(10): 1325-1344, 1994. R.D.Lins and B.T.Ávila. Fast skew detection in images of Documents. In preparation. C.A.B.Mello and R.D.Lins. A Comparative Study on OCR Tools. Vision Interface 99, pp. 700-704, May 1999. Y. Min, S.B. Cho and Y. Lee. A data reduction method for efficient document skew estimation based on Hough Transformation. Proc. 13th International Conference on Pattern Recognition, pp 732-736, August 1996.
240
R.D. Lins and B.T. Ávila
17. L. O’Gorman. The document spectrum for page layout analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(11): 1162-1173, 1993. 18. U. Pal and B.B. Chaudhuri. An improved document skew angle estimation technique. Pattern Recognition Letters, 17(8): 899-904, July 1996. 19. W. Postl. Detection of linear oblique structures and skew scan in digitized documents. In Proc. of the 8th Int. Conf.Pattern Recognition, pp. 687-689, 1986. 20. J. Sauvola and M. Pietikäinen. Skew angle detection using texture direction analysis. Proc.9th Scandinavian Conf. Image Analysis, pp 1099-1106, June 1995. 21. L.G.Shapiro and G.C.Stockman, Computer Vision, March 2000. http://www.cse.msu.edu/~stockman/Book/book.html. 22. R. Smith. A simple and efficient skew detection algorithm via text row accumulation. 3rd International Conference on Document Analysis and Recognition. pp 1145-1148, Aug. 1995. 23. A.L. Spitz. Skew determination in CCITT Group 4 compressed document images. Proceedings of Symposium on Document Analysis and Information Retrieval, pp 11 -25, 1992. 24. A.L.Spitz. Correcting for variable skew in document images. International Journal on Document Analysis and Recognition, vol 6: 192-200, 2004. 25. S.N. Srihari and V. Govindaraju. Analysis of textual images using the Hough Transform. Machine Vision and Applications, 2(3): 141-153, 1989. 26. C. Sun and D. Si. Skew and slant correction for document images using gradient direction. 4th International Conference on Document Analysis and Recognition, pp 142-146, Aug. 1997. 27. H. Yan. Skew correction of document images using interline cross-correlation. CVGIP: Graphical Models and Image Processing. 55(6): 538-543, Nov. 1993. 28. B. Yu and A.K. Jain. A robust and fast skew detection algorithm for generic documents. Pattern Recognition, 29(10): 1599-1629, 1996.
Blind Source Separation Techniques for Detecting Hidden Texts and Textures in Document Images Anna Tonazzini, Emanuele Salerno, Matteo Mochi, and Luigi Bedini * Istituto di Scienza e Tecnologie dell’Informazione - CNR Via G. Moruzzi, 1, I-56124 PISA, Italy
[email protected]
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occur, for instance, in ancient documents previously erased and then re-written (palimpsests), or for transparency or seeping of ink from the reverse side, or from watermarks in the paper. Detecting these features can be of great importance to scholars and historians. In our approach, the document is modeled as the superposition of a number of source patterns, and a simplified linear mixture model is introduced for describing the relationship between these sources and multispectral views of the document itself. The problem of detecting the patterns that are barely perceivable in the visible color image is thus formulated as the one of separating the various patterns in the mixtures. Some examples from an extensive experimentation with real ancient documents are shown and commented.
1 Introduction Revealing the whole contents of ancient documents is an important aid to scholars that are interested in dating the documents or establishing their origin, or reading older and historically relevant writings they may contain. However, interesting document features are often hidden or barely detectable in the original color document. Multispectral acquisitions in the non-visible range, such as the ultraviolet or the near infrared, constitute a valid help in this respect. For instance, a method to reveal paper watermarks is to record an infrared image of the paper using transmitted illumination. Nevertheless, the watermark detected with this method is usually very faint and overlapped to the contents of the paper surface. To make the watermark pattern, or whatever feature of interest, more readable and free from interferences due to overlapped patterns, an intuitive strategy is to process, for instance by arithmetic operations, multiple “views” of the document. In the case where a color scan is available, three different views can be obtained from the red, green, and blue image channels. When * This work has been supported by the European Commission project “Isyreadet”
(http://www.isyreadet.net), under contract IST-1999-57462 A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 241–248, 2004. © Springer-Verlag Berlin Heidelberg 2004
242
A. Tonazzini et al.
available, scans at non-visible wavelengths can be used alone or in conjunction with the visible ones. By processing the different color components, it is possible to extract some of the overlapped patterns, and, sometimes, even to achieve a complete separation of all them. Indeed, since all these color components contain the patterns in different “percentage”, simple difference operations between the colors, after suitable regulation of the levels, can “cancel” one pattern and enhance the other. For the case of watermarks, another infrared image taken using only the reflected illumination can be used for this purpose [1]. On the other hand, some authors claim that subtracting the Green from the Red is able to reveal hidden characters in charred documents [9]. These are however empirical, document-dependent, strategies. We are looking, instead, for automatic, mathematically based, techniques that are able to enhance or even to extract the hidden features of interest from documents of any kind, without the need for adaptations to the specific problem at hand. Our approach to this problem is to model all the document views as linear combinations of a number of independent patterns. The solution consists then in trying to invert this transformation. The overlapped patterns are usually the main foreground text, the background pattern, i.e. an image of the paper (or parchment, or whatever) support, which can contain different interfering features, such as stains, watermarks, etc., and one or more extra texts or drawings, due to previously written and then erased texts (palimpsests), seeping of ink from the reverse side (bleed-through), transparency from other pages (show-through), and other phenomena. Although our linear image model roughly simplifies the physical nature of overlapped patterns in documents [8], it has already proved to give interesting results. Indeed, this model has been proposed in [4] to extract the hidden texts from color images of palimpsests, assuming to evaluate by visual inspection the mixture coefficients. Nevertheless, in general, the mixture coefficients are not known, and the separation problem becomes one of blind source separation (BSS). It has been shown that an effective solution to BSS can be found if the source patterns are mutually independent. The independence assumption gives rise to separation techniques based on independent component analysis, or ICA [6]. Although the linear data model is somewhat simplified, and the independence assumption is not always justified, we already proposed ICA techniques for document image processing [10], and obtained good results with real manuscript documents. In this paper we compare the performance of ICA techniques with simpler methods that only try to decorrelate the observed data. As is known, this requirement is weaker than independence, and, in principle, no source separation can be obtained by only constraining second-order statistics, at least if no additional requirement is satisfied. However, our present aim is the enhancement of the overlapped patterns, especially of those that are hidden or barely detectable, and we experimentally found that this can be achieved in most cases even by simple decorrelation. On the other hand, while the color components of an image are usually spatially correlated, the individual classes or patterns that compose the image are at least less correlated. Thus, decorrelating the color components gives a different representation where the now uncorrelated components of the image could coincide with the single classes. Furthermore, the second-order approach
Blind Source Separation Techniques for Detecting Hidden Texts
243
is always less expensive than ICA algorithms, and due to the poor modeling or to the lack of independence of the patterns, the results from decorrelation can also be better than the ones from ICA.
2
Formulation of the Problem
Let us assume that each pixel (of index in a total of T) of a multispectral scan of a document has a vector value of N components. Similarly, let us assume to have M superimposed sources represented, at each pixel by the vector Since we consider images of documents containing homogeneous texts or drawings, we can also reasonably assume that the color of each source is almost uniform, i.e., we will have mean reflectance indices for the source at the wavelenght. Thus, we will have a collection of T samples from a random N-vector x, which is generated by linearly and istantaneously mixing the components of a random M-vector s through an N × M mixing matrix A:
where the source functions denote the “quantity” of the M patterns that concur to form the color at point Estimating and A from knowledge of is called a problem of blind source separation (BSS). In this application, we assume that noise and blur can be neglected. When only the visible color scan is available, vector has dimension N = 3 (it is composed by the red, green, and blue channels). However, most documents can be seen as the superposition of only three (M = 3) different sources, or classes, that we will call “background”, “main text” and “interfering texture”. In general, by using multispectral/hyperspectral sensors, the “color” vector can assume a dimension greater than 3. Likewise, we can also have M > 3 if additional patterns are present in the original document. In this paper, we only consider the case M = N, that is, same number of sources as of observations, although in principle there is no difference with the general case. It is easy to see that this model does not perfectly account for the phenomenon of interfering texts in documents, which derives from complicated chemical processes of ink diffusion and paper absorption. Just to mention one aspect, in the pixels where two texts are superimposed to each other, the resulting color is not the vector sum of the colors of the two components, but it is likely to be some nonlinear combination of them. In [8], a nonlinear model is derived even for the simpler phenomenon of show-through. However, although the linear model is only a rough approximation, it has demonstrated its usefulness in different applications, as already mentioned above [4] [10].
3
The Proposed Solutions: ICA, PCA, and Whitening
When no additional assumption is made, problem (1) is clearly underdetermined, since any nonsingular choice for A can give an estimate of that accounts for the evidence Even if no specific information is available, statistical
244
A. Tonazzini et al.
assumptions can often be made on the sources. In particular, it can be assumed that the sources are mutually independent. If this assumption is justified, both A and s can be estimated from x. As mentioned in the introduction, this is the ICA approach [6]. If the prior distribution for each source is known, independence is equivalent to assume a factorized form for the joint prior distribution of s:
The separation problem can be formulated as the maximization of eq. 2, subject to the constraint x = As. This is equivalent to the search for a W, such that, when applied to the data produces the set of vectors that are maximally independent, and whose distributions are given by the By taking the logarithm of eq. 2, the problem solved by ICA algorithms is then:
Matrix is an estimate of up to arbitrary scale factors and permutations on the columns. Hence, each vector is one of the original source vectors up to a scale factor. Besides independence, to make separation possible a necessary extra condition for the sources is that they all, but at most one, must be non-Gaussian. To enforce non-Gaussianity, generic super- Gaussian or sub-Gaussian distributions can be used as priors for the sources. These have proven to give very good estimates for the mixing matrix and for the sources as well, no matter of the true source distributions, which, on the other hand, are usually unknown [2]. Although we already obtained some promising result by this approach [10], there is no apparent physical reason why our original sources should be mutually independent, so, even if the data model (1) was correct, the ICA principle is not assured to be able to separate the different classes. However, it is intuitively clear that one can try to maximize the information content in each component of the data vector by decorrelating the observed image channels. To avoid cumbersome notation, and without loss of generality, let us assume to have zero-mean data vectors. We thus seek for a linear transformation such that where W is generally an M × N matrix and the notation < · > means expectation. In other words, the components of the transformed data vector y are orthogonal. It is clear that this operation is not unique, since, given an orthonormal basis of a subspace, any rigid rotation of it still yields an orthonormal basis of the same subspace. It is well known that linear data processing can help to restore color text images, although the linear model is not fully justified. In [7], the authors compare the effect of many fixed linear color transformations on the performance of a recursive segmentation algorithm. They argue that the linear transformation that obtains maximum-variance components is the most effective. They thus derive a fixed transformation that, for a large class of images, approximates the Karhunen-Loeve transformation, which
Blind Source Separation Techniques for Detecting Hidden Texts
245
is known to give orthogonal output vectors, one of which has maximum variance. This approach is also called principal component analysis (PCA), and one of its purposes is to find the most useful among a number of variables [3]. Our data covariance matrix is the N × N matrix:
Since the data are normally correlated, matrix covariance matrix of vector y is:
will be nondiagonal. The
To obtain orthogonal y, should be diagonal. Let us perform the eigenvalue decomposition of matrix and call the matrix of the eigenvectors of and the diagonal matrix of its eigenvalues, in decreasing order. Now, it is easy to verify that all of the following choices for W yield a diagonal
Matrix produces a set of vectors that are orthogonal to each other and whose Euclidean norms are equal to the eigenvalues of the data covariance matrix. This is what PCA does [3]. By using matrix we obtain a set of orthogonal vectors of unit norms, i.e. orthogonal vectors located on a spherical surface (whitening, or Mahalanobis transform). This property still holds true if any whitening matrix is multiplied from the left by an orthogonal matrix. In particular, if we use matrix defined in (8), we have a whitening matrix with the further property of being symmetric. In [3], it is observed that application of matrix is equivalent to ICA when matrix A is symmetric. In general, ICA applies a further rotation to the output vectors, based on higher-order statistics.
4
Experimental Results and Concluding Remarks
Our experimental work has consisted in applying the above matrices to typical images of ancient documents, with the aim at emphasizing the document hidden features in the whitened vectors. For each test image, the results are of course different for different whitening matrices. However, it is interesting to note that the symmetric whitening matrix often performs better than ICA, and, in some cases, it can also achieve a separation of the different components, which is the final aim of BSS. Here, we show some examples from our extensive experimentation. The first example (Figure 1) describes the processing of an ancient manuscript which presents three overlapped patterns: a main text, an underwriting barely visible in the original image, and a noisy background with
246
A. Tonazzini et al.
Fig. 1. Full separation with symmetric orthogonalization: (a) grayscale representation of the color scan of an ancient manuscript containing a partially hidden text; (b) first symmetric orthogonalization output from the RGB components of the color image; (c) second symmetric orthogonalization output from the same data set.
significant paper folds. We compared the results of the FastICA algorithm [5] [10], the PCA, and the symmetric whitening, all applied to the RGB channels, and found that full separation and enhancement of the three classes is obtained by the symmetric orthogonalization only. ICA failure might depend, in this case, on the data model inaccuracy and/or the lack of mutual independence of the classes. In Figure 2, we report another example where a paper watermark pattern is detected and extracted. In this case, we assume the document as constituted of two only classes: the foreground pattern, with drawings and text, and the background pattern with the watermark, so that two only views are needed. We used two infrared acquisitions, the first taken under front illumination, the second taken with illumination from the back. In this case a good extraction is achieved by using all the three methods proposed. However, the best one is obtained with FastICA. Finally, Figure 3 shows a last example of extraction of a faint underlying pattern, using the RGB components. In this case, all the three proposed methods performed similarly.
Blind Source Separation Techniques for Detecting Hidden Texts
247
Fig. 2. Watermark detection: (a) infrared front view; (b) back illumination infrared view; (c) one FastICA output.
These experiments confirmed our initial intuition about the validity of BSS techniques for enhancing and separating the various features that appear as overlapped in many ancient documents. No conclusions can be instead drawn about the superiority of one method over the others for all documents. We can only say that, when the main goal is to enhance partially hidden features, at least one of the three methods proposed always succeeded in reaching the scope in all our experiments. The advantages of these techniques are that they are quite simple and fast, and do not require reverse side scans or image registration. Our research programs for the near future regard the development of more accurate numerical models for the phenomenon of pattern overlapping in documents. Acknowledgements. We would like to thank the Isyreadet partners for providing the original document images. Composition of the Isyreadet consortium: TEA SAS (Catanzaro, Italy), Art Innovation (Oldenzaal, The Netherlands), Art Conservation (Vlaardingen, The Netherlands), Transmedia (Swansea, UK), Ate-
248
A. Tonazzini et al.
Fig. 3. Detection of an underlying pattern: (a) grayscale version of the original color document; (b) underlying pattern detected by symmetric orthogonalization.
lier Quillet (Loix, France), Acciss Bretagne (Plouzane, France), ENST (Brest, France), CNR-ISTI (Pisa, Italy), CNR-IPCF (Pisa, Italy).
References 1. http://www.art-innovation.nl/ 2. Bell AJ, Sejnowski TJ : Neural Computation (1995) 7:1129–1159 3. Cichocki, A., Amari, S.-L: Adaptive Blind Signal and Image Processing (2002) Wiley, New York. 4. Easton, R.L.: http://www.cis.rit.edu/people/faculty/easton/k-12/index.htm 5. Hyvärinen, A., Oja, E.: Neural Networks (2000) 13:411–430. 6. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis (2001) John Wiley, New York. 7. Ohta, Y., Kanade, T., Sakai, T.: Computer Graphics, Vision, and Image Processing (1980) 13:222–241. 8. Sharma, G.: IEEE Trans. Image Processing (2001) 10:736–754. 9. R. Swift: http://www.cis.rit.edu/research/thesis/bs/2001/swift/thesis.html. 10. Tonazzini, A., Bedini, L., Salerno, E.: Int. J. Document Analysis and Recognition (2004) in press.
Efficient Removal of Noisy Borders from Monochromatic Documents Bruno Tenório Ávila and Rafael Dueire Lins Universidade Federal de Pernambuco, Recife - PE, Brazil {bta, rdl}@ee.ufpe.br
Abstract. This paper presents an algorithm based on Flood Fill, Component Labelling, and Region Adjacency Graphs for removing noisy borders in monochromatic images of documents introduced by the digitalization process using automatically fed scanners. The new algorithm was tested on 20,000 images and provided better quality images and time-space performance than its predecessors including the widespread used commercial tools. Keywords: Document Image Analysis, Border removal, Binary Images.
1 Introduction The evolution from paper to electronic documents is happening at a very fast rate. However, the problem of large amounts of paper documents inherited from a not very distant past needs to be addressed. Digitalization bridges past and present technologies, organizing, indexing, storing, retrieving directly or making accessible through networks, and keeping the contents of documents for future generations. Monochromatic images are adopted by most organizations as their paper documents tend to be typed or hand-written and have no iconographic or artistic value. Automatically fed scanners tend to be of widespread use for the digitalization of this kind of documents. Scanner resolution is set to be as low as possible to save storage space. Depending on a number of factors such as the size of the document, its state of conservation and physical integrity, the presence or absence of dust in the document and scanner parts, etc. very frequently the image generated is framed either by a solid or stripped black border. This undesirable artefact, also known as marginal noise 6, not only drops the quality of the resulting image for CRT visualization, but also consumes space for storage and large amounts of toner for printing. Removing such frame manually is not practical due to the need of specialized users and time consumed in the operation. Several production-line scanner manufacturers have developed softwares for removing those noisy borders. However, the softwares tested are too greedy and tend to remove essential parts of documents. Although this is a recent problem, several researchers have addressed it already. Le 9 proposes a method for border removal that consists in splitting the document into a grid of several image blocks. Through some statistical analysis, each of the blocks is A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 249–256, 2004. © Springer-Verlag Berlin Heidelberg 2004
250
B.T. Ávila and R.D. Lins
classified as text, non-text (pictures, drawings, etc.), paper background or border. Blocks classified as border are removed. Le’s algorithm is not suitable to the kind of documents addressed herein, as document information may be merged with the noisy border. In this case, it is most likely that information is lost, by removing the block. Kanungo et al 8 propose a document degradation model to analyze and simulate the distortion that occurs while scanning thick and bounded documents. There is some difficulty in tuning the parameters for the proposed algorithm. Baird 3 describes a defect model with a number of document image parameters, including size, resolution, horizontal and vertical scaling factors, translational offsets, jitter, defocusing and sensitivity. Fan et al 6 propose a scheme to remove the marginal noise of scanned documents by reducing the resolution of the image of documents. This process makes the textual part of the document disappear leaving only blocks to be classified either as images or border by a threshold filter. The block classification is used to segment the original image, removing the noise border. A recent algorithm 1 modifies the flood fill scheme 4 to remove noise borders in monochromatic documents. Statistical analysis of document features is used to find parameters to stop the flooding process. In this paper, a new and efficient algorithm for removing noisy borders from monochromatic documents is presented. This algorithm uses Component Labelling and Region Adjacency Graphs to classify image regions to filter out the marginal noise. A batch of over 20,000 images was used as benchmark yielding good quality images with excellent time-space performance. The new algorithm has shown to produce better quality images than the commercial software libraries available.
2 Document Features The images target of this study is obtained by the digitalization of governmental (bureaucratic) documents. They range from typeset letters, hand filled-in forms, to handwritten documents. Most documents make use of translucent paper in such a way that back-to-front interference was not observed 1012. Documents range in size from A5 to Legal, with predominance of size around A4. They exhibit no iconographic value, figures are restricted to printed seals. Some of them may include a 3cm x 4cm black-and-white (most usual) or colour photograph glued to the paper sheet. Frequently, these documents have punches made on the left-hand side margin for filing them, but sometimes punches were made on the document print. Many documents also have smaller holes made by staples. The state of conservation of documents also varies widely. Very frequently the filing punches on the left margin are torn off. These damages make noisy borders irregular in shape, thus increase the computational difficulty for their automatic removal. Digitalization was performed by an automatically fed production line scanner reference Scanner DS 1500 16 manufactured by Kodak Corporation, using as set-up
Efficient Removal of Noisy Borders from Monochromatic Documents
251
automatic brightness/contrast. Documents may not be aligned in the scanner feeder tray. Different mechanical effort in the feeder device may also cause document rotation. The digitalization resolution adopted was 200 dpi. The resulting image is black-and-white. The noisy border has the following features: 1) Frames the document image. 2) Is either solid black (as shown in Figure 1), or 3) It presents a striped pattern in black and salt-and-pepper noise 7 (e.g. Figure 2). However, for the reasons explained above, two complications may arise: 1) Part of the paper document may be torn off, yielding an image with irregular shape noise frame, such as the one presented in Figure 3. 2) Part of the document information may be connected to the noisy border frame, as presented in Figure 4.
Fig. 1. Solid black noisy frame
Fig. 2. Pattern with stripes
Fig. 3. Irregular shape noise border
Fig. 4. Information linked to noise border
3 The New Algorithm The proposed algorithm encompasses five phases: 1) Flooding; 2) Segmentation; 3) Component labelling; 4) Region adjacency graph generation; and 5) Noise border removal. In what follows, each of these phases is explained. During flooding the algorithm moves outside in from the noisy surrounding border of the image towards the document. A segment is formed by a horizontal run of connected black pixels. Each segment is characterized by three parameters: the horizontal coordinate of its leftmost and rightmost pixels and its vertical coordinate. All border segments are stored in a list (segment_list). This phase of the border removal scheme works as a “flood-fill” algorithm 4. Segmentation splits the area scanned by the flood fill step into smaller regions. Document information merged to the noise border needs to be segmented. The segmentation criterion is that the segment width or height of textual information should be narrower than pre-defined parameters (Figure 6), whose values are statistically calculated from the document batch. In the next step of the algorithm, each entry in segment_list (horizontal segments) is swept looking for segments narrower than max_hsegment as a possible part of text. At
252
B.T. Ávila and R.D. Lins
this point of the segmentation process one looks for vertical segments of height less or equal to max_vsegment as a horizontal entry.
Fig. 5. Arrow 1 points at a vertical_entry. Fig. 6. Pixels in red are classified as horiArrow 2 points at a horizontal_entry zontal_entry
At this point the algorithm “merges” together pixels in blocks by using component labelling 4x4 13. The classical algorithm is adapted to provide three different types of blocks (border, horizontal and vertical entries). At the end of this step each pixel in segment_list gets a unique label. A list of blocks is generated with their dimensions and classification. This phase is responsible for a great speed-up in relation to 1. Path compression 5 was used to prune paths improving the performance of the labelling. The next step in the algorithm is to generate a graph stating how blocks of pixels relate to each other in terms of vicinity 13. This allows blocks to be merged by simplifying the graph by recognizing that the adjacency relationship has already been stated. Each block gets a list of associated neighbours, duplicated neighbours are removed from the graph, avoiding blocks to be visited several times and thus increasing the efficiency of the algorithm. At this point of the algorithm all blocks in the image have already been visited and have been classified either as border or image. Several criteria are used to decide if a block is removed or not depending on its classification, dimensions, horizontal and vertical projections, and position in the image. One must observe that at no point this algorithm scans the whole image, being restrained to the initial border area marked outside in by the flood fill in its first step. An inside-out analysis of the image allows it to be cropped removing part of the noise border. Alignment analysis (and skew correction, if needed) of the image should be made before performing the image crop. This saves storage space. The remaining pixels in blocks corresponding to border are made white. Image crop was not implemented in this version of the algorithm implementation.
4 Image Quality This section focus on the quality of the images obtained by using the new algorithm, its predecessor 1, and the commercial tools Scanfix 18, Leadtools 17, BlackIce 14, Skyline Tools 19, and ClearImage 15 for removing the noisy border of document images. These algorithms seem to be proprietary and unpublished. No pre or post processing was performed on any image, i.e. the way images came out from the scanning process they were submitted to the algorithms and tools, compressed in
Efficient Removal of Noisy Borders from Monochromatic Documents
253
TIFF (which is suitable for this kind of image 11) and stored. Salt-and-pepper filtering previous to border removal would possibly reduce overall processing time. Regular solid black noisy frames offer little effort for removal. The real difficulty arises in the case of irregular borders by torn-off margins in the paper document. Even more difficult is the case in which the border reaches document information. Figures 3 and 4 exemplify these two cases. These images are used to access the quality of filtering performed by algorithms and tools. The result of filtering the test images by using the newly introduced algorithm is shown in Figures 9 and 10, respectively. Both algorithms were fed with the same segmentation parameters and produced similar images.
Fig. 7. Use of algorithm [2]
Fig. 8. Use of algorithm [2]
Fig. 9. New algorithm
Fig. 10. New algorithm
The tests performed with the Scanfix toolkit 18 showed that it works efficiently in the case of noisy borders that do not merge with document information. However, as shown in Figures 11 and 12, whenever the noisy border enters into document information the tool is not able to remove it, yielding an image of poor quality. On the other hand, no information loss was observed.
Fig. 11. Noisy border removed by Scanfix
Fig. 12. Use of Scanfix
Fig. 13. Noisy border re- Fig. 14. Use of Leadtools moved by Leadtools
The Leadtools toolkit 17 needs a number of parameters to be set for image filtering. (Parameters adopted: Border Percent=40; White Noise Length=40; Variance=40). Leadtools yielded poor quality images where part of the document information was removed (Figure 13) and most of the noisy pattern remained in the filtered image (Figure 14). The BlackIce toolkit 14 takes no user input parameters. Although its border removal algorithm is not described, the filtered images allow one to infer that it implements
254
B.T. Ávila and R.D. Lins
some sort of naïve flood-fill algorithm. Images suffer significant loss of information whenever the noisy border is merged with document text, as can be seen in Figures 15 and 16.
Fig. 15. Noisy border removed by BlackIce
Fig. 16. Use of BlackIce
Fig. 17. Noisy border removed by Skyline
Fig. 18. Use of Skyline
Skyline toolkit 19 takes as user input a parameter n, called Threshold, which is used to remove n pixel border from of the image. It works “blindly” clipping either noisy borders or information. As one may expect, such a tool yields images of disastrous quality, in general. However, in the two test images of Figures 3 and 4 no information is lost, as can be seen in Figures 17 and 18. The only tool tested that produced quality images was ClearImage 15. However, careful look at Figure 22 shows that ClearImage removed part of the information of the letter “a” during filtering, while the algorithm presented herein keeps that information untouched.
Fig. 19. New algorithm
Fig. 20. New algorithm
Fig. 21. Filtering by ClearImage
Fig. 22. Filtering by ClearImage
5 Time-and-Space Figures This section analyses the time-and-space performance of the border removal algorithms. The new and old 1 algorithms were implemented in C. All algorithms were executed on an Intel Pentium IV, 2.4 GHz clock, 512MB RAM and HD IDE. The set of images used to test the algorithm was split into four groups: CD1, CD2, CD3 and CD4. All images are black-and-white, were digitized with a resolution of 200 dpi, and were stored using the TIFF file format using compression algorithm CCITT G4. Images in CD1 may be considered “clean”, while the others may be considered “very noisy” (noise border frame covers over 20% of the image area). The quality of the resulting images of the algorithm proposed herein is similar to the one
Efficient Removal of Noisy Borders from Monochromatic Documents
255
described in reference 1. For both algorithms, over 95% of the images all border noise was suitably removed. In the case of the other less than 5% of the documents, although some of the noisy border still remained in the filtered image, no document information was removed, keeping their contents integral. Images filtered with ClearImage get close but worse to them in quality. Border removal yields not only image files of smaller size, but also makes possible to compression algorithms to act more effectively. Table 1 shows the results obtained.
CD2, CD3, and CD4 consumed more processing time than CD1, due to the strong presence of border noise in the images. However, for these images the rate of size compression presented the best results reaching on average 29%. In the Old and New algorithms the elapsed time is proportional to the quantity of noise in the image, but due to component labelling the constant of proportionality of the New algorithm is much smaller than in the Old one. The new algorithm is almost twice as fast as the ClearImage toolkit and three times faster than the algorithm presented in 1.
6 Conclusions This paper presents an efficient algorithm for removing noisy black frames inserted on monochromatic images of documents by automatically fed production line scanners. It works well even on images of torn-off documents or in the presence of irregular white stripes on the noise border frame. The new algorithm yielded far better quality images than the commercial tools Scanfix, Leadtools, BlackIce, and Skyline of widespread use today. ClearImage is the only commercial tool to offer images of quality close, but worse, to the ones produced by the algorithm presented herein. The new algorithm was tested on over 20,000 images of documents. At least in 95% of them all border noise was suitably removed. In the remaining 5%, although some of the noisy border still remained in the filtered image, no document information was removed, keeping the contents integral. The algorithm introduced here yielded the best time performance of all tools and algorithms tested. Details of the new algorithm presented in this paper and the border removal algorithm in reference 1, including their description in pseudocode, will appear in reference 2. Statistical methods on how to automatically estimate the tuning of parameters are also provided.
256
B.T. Ávila and R.D. Lins
References 1. B.T.Ávila and R.D.Lins, A New Algorithm for Removing Noisy Borders from Monochromatic Documents, Proc. of ACM-SAC’2004, pp 1219-1225, Cyprus, ACM Press, March, 2004. 2. B.T.Ávila and R.D.Lins, Removing Noise Borders from Monochromatic Scanned Documents, in preparation. 3. H.S.Baird, Document image defect models and their uses, Proc. Snd Int. Conf. on Document Analysis and Recognition, Japan, IEEE Comp. Soc., pp. 62-67, 1993. 4. M.Berger, Computer Graphics with Pascal. Addison-Wesley, 1986. 5. T.H.Cormen, C.E.Leiserson, R.L.Rivest, C.Stein. Introduction to algorithms, MIT Press, Second Edition, 2001. 6. K.C.Fan, Y.K.Wang, T.R.Lay, Marginal noise removal of document images, Patt.Recog. 35, 2593-2611, 2002. 7. L.O’Gorman and R.Kasturi, Document Image Analysis, IEEE Computer Society Executive Briefing, 1997. 8. T.Kanungo, R.M.Haralick, I.Phillips, Global and local document degradation models, Proc. Snd Int. Conf. Doc. Analysis and Recognition, pp. 730-734, 1993. 9. D.X.Le, Automated borders detection and adaptive segmentation for binary document images. National Library of Medicine. http://archive.nlm.nih.gov/pubs/le/twocols/twocols.php. 10. R.D.Lins, M.S.Guimarães Neto, L.R. França Neto, and L.G. Rosa. An Environment for Processing Images of Historical Documents. Microprocessing & Microprogramming, pp. 111-121, North-Holland, January 1995. 11. R.D.Lins and D.S.A.Machado, A Comparative Study of File Formats for Image Storage and Transmission, vol 13(1), pp 175-183, 2004, Journal of Electronic Imaging, Jan/2004. 12. C.A.B.Mello and R.D.Lins. Image Segmentation of Historical Documents, Visual 2000, Aug. 2000, Mexico. 13. L.G.Shapiro and G.C.Stockman, Computer Vision, March 2000. http://www.cse.msu.edu/~stockman/Book/book.html. 14. BlackIce Document Imaging SDK 10. BlackIce Software Inc. http://www.blackice.com/. 15. ClearImage 5. Inlite Research Inc. http://www.inliteresearch.com. 16. Kodak Digital Science Scanner 1500. http://www.kodak.com/global/en/business/docimaging/1500002/ 17. Leadtools 13. Leadtools Inc. http://www.leadtools.com. 18. ScanFix Bitonal Image Optimizer 4.21. TMS Sequoia, http://www.tmsinc.com. 19. Skyline Tools Corporate Suite 7. Skyline Tools Imaging. http://www.skylinetools.com.
Robust Dichromatic Colour Constancy Gerald Schaefer School of Computing and Technology The Nottingham Trent University Nottingham, United Kingdom
[email protected]
Abstract. A novel colour constancy algorithm that utilises both physical and statistical knowledge is introduced. A physics-based model of image formation is combined with a statistics-based constraint on the possible scene illumination. Based on the dichromatic reflection model, the intersection of two colour signal planes from two objects will yield the scene illuminant. However, due to noise and insufficient segmentation this approach tends not to work outside the lab. By removing those intersections that are likely to produce unstable illuminant estimates and applying an illumination constraint in form of a set of feasible reference lights the colour constancy algorithm presented clearly outperforms the conventional approach and provides excellent results on a benchmark set of real images. Keywords: colour, colour constancy, dichromatic reflection model, specularities
1 Introduction The sensor responses of a device such as a digital camera depend both on the surfaces in a scene and on the prevailing illumination conditions: an object viewed under two different illuminants will yield two different sets of sensor responses. For humans however, the perceived colour of an object is more or less independent of the illuminant; a white paper appears white both outdoors under bluish daylight and indoors under yellow tungsten light, though the responses of the eyes’ colour receptors, the long-, medium-, and short-wave sensitive cones, will be quite different for the two cases. Researchers in computer vision have long sought algorithms to make colour cameras equally colour constant. Colour constancy algorithms can be divided into two main categories: physics-based and statistics-based. Statistical methods try to correlate the distribution of colours in the scene with statistical knowledge of common lights and surfaces while physics-based approach are based on an understanding on how physical processes such as highlights and interreflections manifest themselves in images. Perhaps the best studied physics-based colour constancy algorithms, and the ones which show the most (though still limited) functionality, are based on the dichromatic reflectance model. Under the dichromatic model, the light reflected from a surface comprises two physically different types of reflection: interface and body reflection. The body part models conventional matte surfaces. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 257–264, 2004. © Springer-Verlag Berlin Heidelberg 2004
258
G. Schaefer
Here light enters the surface, is scattered and absorbed by the internal pigments, some of the scattered light is then re-emitted randomly, thus giving the body reflection Lambertian character. Interface reflection which models highlights, and for which the light does not enter the surface but is reflected in a mirror-like way, usually has the same spectral power distribution as the illuminant. Because light is additive the colour signals from inhomogeneous dielectrics will then fall on what is called a dichromatic plane spanned by the reflectance vectors of the body and the interface part respectively. As the specular reflectance represents essentially the illuminant reflectance, this illuminant vector is contained in the dichromatic plane of an object. The same is obviously true for a second object. Thus, a simple method for estimating illumination colour is to find the intersection of the two dichromatic planes. Indeed, this algorithm is well known and has been proposed by several authors [5, 8,9,7]. Dichromatic colour constancy algorithms are hence theoretically able to provide an illuminant estimate given two surfaces in a scene. In practice however, these algorithms only work on ‘toy’ images, for example images containing highly saturated plastic objects or similar scenes viewed under controlled conditions but not on natural images. The reasons for this are twofold. First, an image must be segmented into different dichromatic regions and general segmentation has proven to be very hard to solve. The second and more serious problem is that the dichromatic computation is not robust. When the interface and body reflectance RGBs are close together (the case for most surfaces) the dichromatic plane can only be approximately estimated. Moreover, this uncertainty is magnified when two planes are intersected. This problem is particularly serious when two surfaces have similar colours. In this case the dichromatic planes have similar orientations and the recovery error for illuminant estimation is often very high. In this paper a robust approach to dichromatic colour constancy is presented. Intersections between pairs of dichromatic planes are calculated. Only those intersections that are likely to lead to good illuminant estimates are taken into account and used to build derive likelihoods for an image to have been captured under each of a set of reference lights (thus incorporating an illumination constraint). The maximum likelihood candidate is then used to identify an estimate of the scene illumination. The rest of the paper is organised as follows. Section 2 provides a brief review of colour image formation and the dichromatic reflection model. Section 3 presents the novel robust dichromaticalgorithm in detail. Section 4 gives experimental results while Section 5 concludes the paper.
2 2.1
Background Image Formation
An image taken with a linear device such as a digital colour camera is composed of sensor responses that can be described by
Robust Dichromatic Colour Constancy
259
where is wavelength, is a 3-vector of sensor responses (RGB pixel values), C is the colour signal (the light reflected from an object), and R is the 3-vector of sensitivity functions of the device. Integration is performed over the visible spectrum The colour signal itself depends on both the surface reflectance and the spectral power distribution of the illumination. For pure Lambertian (matte) surfaces is proportional to the product and its magnitude depends on the angle(s) between the surface normal and the light direction(s). The brightness of Lambertian surfaces is independent of the viewing direction.
2.2
Dichromatic Reflection Model
In the real world, however, most objects are non-Lambertian, and so have some glossy or highlight component. The combination of matte reflectance together with a geometry dependent highlight component is modeled by the dichromatic reflectance model [6,8] for inhomogeneous dielectrics. Inhomogeneous dielectrics are composed of more than one material with different refractive indices; usually there exist a vehicle dielectric material and embedded pigment particles. Examples of inhomogeneous dielectrics include paints, plastics, and paper. The dichromatic reflection model states that the colour signal is composed of two additive components, one being associated with the interface reflectance and the other describing the body (or matte) reflectance part [6]. Both of these components can further be decomposed into a term describing the spectral power distribution of the reflectance and a scale factor depending on the geometry. This can be expressed as
where and are the spectral power distributions of the interface and the body reflectance respectively, and and are the corresponding weight factors depending on the geometry which includes the incident angle of the light, the viewing angle and the phase angle. Equation (2) shows that the colour signal can be expressed as the weighted sum of the two reflectance components. Thus the colour signals for an object are restricted to a plane. Making the roles of light and surface explicit, Equation (2) can be further expanded to
Since for many materials the index of refraction does not change significantly over the visible spectrum it can be assumed to be constant. is thus a constant and Equation (3) becomes:
where now describes both the geometry dependent weighting factor and the constant reflectance of the interface term.
260
G. Schaefer
By substituting equation (4) into equation (1) we get the device’s responses for dichromatic reflectances:
which we rewrite as
where R, G, and B are the red, green, and blue pixel value outputs of the digital camera. Because the RGB of the interface reflectance is equal to the RGB of the illuminant E we rewrite (6) making this observation explicit:
2.3
Dichromatic Colour Constancy
Equation (7) shows that the RGBs for a surface lie on a two-dimensional plane, one component of which is the RGB of the illuminant. If we consider two objects within the same scene (and assume that the illumination is constant across the scene) then we end up with two RGB planes. Both planes however contain the same illuminant RGB. This implies that their intersection must be the illuminant itself. Indeed, this is the essence of dichromatic colour constancy and has been proposed by several authors [5,8,9,7]. Though theoretically sound, dichromatic colour constancy algorithms perform well only under idealised conditions. For real images the estimate of the illuminant turns out not to be that accurate. The reason for this is that in the presence of a small amount of image noise the intersection of two dichromatic planes can change quite drastically, depending on the orientations of the dichromatic planes. Dichromatic colour constancy tends to work well for highly saturated surfaces taken under laboratory conditions but much less well for real images.
3
Robust Dichromatic Colour Constancy
The colour constancy algorithm presented in this paper is again based on the dichromatic reflectance model. Rather than finding the best overall intersection of all planes as in [5,9,7] the result of each intersection of pairs of dichromatic planes is utilised. Proceeding this way also allows to discard those intersections that would lead to inaccurate illuminant estimates as is explained in the following.
Robust Dichromatic Colour Constancy
3.1
261
Robust Dichromatic Estimates
Eliminating unlikely or inaccurate illuminant estimates proceeds in several stages. Let denote the definition of a dichromatic plane which is found through the application of SVD on the colour signal matrix. Since, due to noise or other reasons such as incorrect segmentation illuminant estimates may fall outside the gamut of common light sources, in a first step those dichromatic planes are discarded that do not intersect the convex hull of common lights [2]. Alternatively, relaxing this scheme a bit, those planes are discarded that do not pass close enough to where ‘close’ is defined as the minimal angle between the plane and any vector in
which is compared to a (small) threshold dichromatic planes where From the
planes
Applying this constraint leaves intersections can be calculated.
However, due to noise not all intersection will provide good estimates. As has been shown in [3] the intersection of dichromatic planes with similar orientations is far less stable than an estimate based on planes that are close to orthogonal to each other. Estimates of planes with similar orientations will therefore provide little accuracy and should not be considered. Hence, a threshold is set which defines the minimal angle between two dichromatic planes so that the resulting intersection will be consider in later stages of the algorithm. Since an intersection of two dichromatic planes might still produce a physical impossible or unlikely estimate a further constraint is applied which ensures that falls inside the convex hull of common lights or close to it. A threshold is defined for the distance between and where distance is defined similar to Equation 8 as the minimal angle between and any vector in
Applying and leaves estimation. To summarise only those intersections
intersection for the illuminant that satisfy
are considered where
It is apparent that the 4-th constraint implicitly encodes the first two conditions, too. However, these two constraint are explicitly expressed to make the algorithm computationally less expensive.
262
3.2
G. Schaefer
Integration of Intersection Estimates and Selection of Solution
Following the procedure above produces intersections of dichromatic planes many of which should provide a good estimate to the colour constancy problem. A simple way of integrating these estimates would be to take the average vector of all intersection as a solution for the algorithm. In this chapter however, a different approach is taken. As has been shown in [3,2] the application of an illuminant constraint can vastly improve dichromatic colour constancy. While in [3] a linear constraint was introduced, [2] used convex and non-convex gamut contraints. For the algorithm presented here a different - discrete - iluminant constraint is employed in the form of a set of reference illuminants which are selected beforehand. Having selected a set of reference lights the intersection estimates need to be combined to provide a single solution. An solution vector is built which describes the likelihoods for the image to have been captured under each of the reference lights. The entries of are calculated by incrementing the likelihoods for each intersection by the inverse of the angle between the intersection and each reference light. Formally the solution vector is thus defined as
Based on a single solution has to be identified. A maximum likelihood selection is performed, i.e. the illuminant with the highest value in is returned
However, more information than just the single illuminant can be provided. The likelihoods of the solution vector can be used to quantify the confidence in the solution or to produce error bars of the recovery. Rather than just a single estimate a few most likely illuminants can be returned together with their respective likelihoods. Finally, the whole vector could be provided for further processing.
4
Experimental Results
The dataset used for the experiments is provided by the Computer Vision Lab at Simon Eraser University (SFU) and is reported in [1]. These image were captured with the intend of providing a benchmark set for colour constancy algorithms. The ‘Mondrian’ subset was chosen for the experiments reported here. This consists of images of 22 objects taken under up to 11 different lights - in total there are 223 images. The Mondrian set shows some but not very dominant highlights, the choice to use this image set is to deliberate set a hard test. Obviously, for dichromatic colour constancy a scene must be segmented according to the objects present. Since segmentation is a hard problem a fairly ‘naive’ approach to define dichromatic surfaces was adopted here. Each image
Robust Dichromatic Colour Constancy
263
was segmented into blocks (size 30 × 30 pixels), then each block tested for dichromatic properties. A block was defined to be dichromatic if, after performing SVD on the image data, the first two eigenvectors capture more than 98.5% of the variance. This ensures that one-dimensional image blocks such as background and ill-defined dichromatic blocks as well as blocks where two objects coincide (and hence result in full-dimensional data) were discarded. After the segmentation step a set of dichromatic planes had been obtained which were used as input for dichromatic colour constancy algorithms. For comparison, both Tominaga’s [7] and Lee’s [5] method were implemented, as well as the constrained methods by Finlayson and Schaefer based on a locus [3] and convex and alpha hull constraints [2]. For the constrained techniques as well as for the algorithm presented in this paper prior knowledge was assumed in the form that the illumination is known to be one of the 11 lights used to produce the SFU database. The following parameters were used for the robust algorithm:
The results over the whole dataset are listed in Table 1; given are the median, mean and maximum angular error. Looking first at the results obtained by classical dichromatic algorithms, a striking difference in performance between Tominaga’s approach and Lee’s implementation can be observed. Lee’s algorithm gives errors about twice as high as those returned by Tominaga’s method. As both algorithm are based on the same idea, that of intersecting dichromatic planes this might appear a bit surprising. However, in fact there are two reasons which make Lee’s approach perform worse. First, Lee’s algorithm operates in chromaticity space while Tominaga’s operates in RGB space. The use of chromaticity space rather than the prime space can lead to a drop in performance. More importantly however is the fact that while Tominaga’s sub-space method essentially provides a least squares intersection of dichromatic planes, this is not the case for Lee’s dual space approach.
Returning to Table 1 it can be seen that the constrainted techniques from [3] and [2] perform much better with median errors between 2.19 and 4.42 degrees compared to 7.26 degrees achieved by Tominaga’s algorithm. However, the robust algorithm introduced in this paper still clearly outperforms even those techniques. The median error over the whole dataset is remarkably only 1.06 degrees.
264
G. Schaefer
This indeed puts it right into the league of the best statistical colour constancy algorithms which is confirmed by the performance of the Colour by Correlation technique, one of the best colour constancy algorithms to date [4], whose results are also given in Table 11. It can be seen that the robust dichromatic algorithm performs even slightly better than the Colour by Correlation approach!
5
Conclusions
A novel approach to colour constancy was introduced. A physics-based image formation model, the dichromatic reflection model, is combined with statistical knowledge of common lights in form of an illumination constraint of a discrete set of reference lights. Intersections of dichromatic planes are calculated with only those intersections contributing to the colour constancy process that are likely to yield good illuminant estimates. These stable intersections are integrated to produce likelihoods that the image was captured under each of the reference lights. The maximum likelihood candidate provides the final illuminant estimate. Experiments on a benchmark set of real images show that this novel approach not only outperforms all other algorithms based on the dichromatic model but also that it achieves the same level of colour constancy comparable to the best statistical techniques.
References 1. K. Barnard. Practial Colour Constancy. PhD thesis, Simon Fraser University, School of Computing, 1999. 2. G. Finlayson and G. Schaefer. Convex and non-convex illuminant constraints for dichromatic colour constancy. In IEEE Int. Conference Computer Vision and Pattern Recognition, volume 1, pages 598–604, 2001. 3. G. Finlayson and G. Schaefer. Solving for colour constancy using a constrained dichromatic reflection model. Int. J. Computer Vision, 42(3): 127–144, 2001. 4. G.D. Finlayson, S.D. Hordley, and P.M. Hubel. Color by correlation: A simple, unifying framework for color constancy. IEEE Trans. Pattern Analysis and Machine Intelligence, 23(11):1209–1221, November 2001. 5. H.-C. Lee. Method for computing the scene-illuminant from specular highlights. Journal Optical Society of America A, 3(10):1694–1699, 1986. 6. S.A. Shafer. Using color to separate reflection components. Color Research Applications, 10(4):210–218, 1985. 7. S. Tominaga. Multi-channel vision system for estimating surface and illuminant functions. Journal Optical Society of America A, 13:2163–2173, 1996. 8. S. Tominaga and B.A. Wandell. Standard surface-reflectance model and illuminant estimation. Journal Optical Society of America A, 6(4):576–584, April 1989. 9. F. Tong and B.V. Funt. Specularity removal for shape from shading. In Vision Interface 1988, pages 98–103, 1988.
1
The author wishes to thank Steve Hordley of the University of East Anglia for providing the Colour by Correlation results.
Soccer Field Detection in Video Images Using Color and Spatial Coherence Arnaud Le Troter, Sebastien Mavromatis, and Jean Sequeira LSIS Laboratory LXAO group, University of Marseilles FRANCE
Abstract. We present an original approach based on the joint use of color and spatial coherence to automatically detect the soccer field in video sequences. We assume that the corresponding area is significant enough for that. This assumption is verified when the camera is oriented toward the field and does not focus on a given element of the scene such as a player or the ball. We do not have any assumption on the color of the field. We use this approach to automatically validate the image area in which the relevant scene elements are. This is a part of the SIMULFOOT project whose objective is the 3D reconstruction of the scene (players, referees, ball) and its animation as a support for cognitive studies and strategy analysis.
1
Introduction
The SIMULFOOT project started in Marseilles three years ago within the frame of the IFR Marey, a new organization dedicated to Biomedical Gesture Analysis[1][2]. Our main objective is to provide a technological platform to cognitive scientists so that they can investigate in new theories about group behaviors and individual perceptions, and validate them [3] [4] [5]. A direct application of this project as suggested by its name is to provide an efficient tool for analyzing soccer games, as it has been described in other similar projects [6] [7] [8]. There are many problems to solve for implementing this technological platform, such as detecting the players and the ball, or finding landmarks to provide the 2D to 3D registration [9]. But all these elements have to be detected in the field and not out of it (e.g. in the stand): thus, the characterization of the field is the first problem to solve and this is the topic of this paper. Research works have been developed to automatically extract the foreground elements from the background in video sequences, such as those described in [10] and [11]. But the problem mentioned in these papers is quite different although it looks similar: we do not want to characterize the background but to find the area in the image that corresponds to the relevant part of the background. In other words, we want to provide a background segmentation into two parts, the relevant one and the non-relevant one. Let us also mention the works developed by Vandenbroucke on a method that provide a color space segmentation and classification before using snakes A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 265–272, 2004. © Springer-Verlag Berlin Heidelberg 2004
266
A. Le Troter, S. Mavromatis, and J. Sequeira
to track the players in video sequences [12]. The ideas developed in this paper are very interesting but they do not fully take advantage of the color and spatial coherence, such as the works on the color space described in [13] and [14]. For these reasons, we have developed a new approach that integrates all these features. This approach consists in: Analyzing the pixel distribution in the color space Characterizing the relevant area in the color space (color coherence) Selecting the corresponding points in the image Using a spatial coherence criterion on the selected pixels in the image
Fig. 1. The selected area is bounded by the red line (points in yellow are landmarks)
We will mainly focus on the color space segmentation that is a very difficult task in many cases. The approach we propose in this paper is robust (on the Fig. 1 we can see there are many variations of green and it works anyway) and it can be applied to a wide set of situations (other than soccer field detection) The only assumption we make is that the soccer field type of color is majority in the image (i.e. the camera is not oriented toward the stand and does not take a close-up of a player) as illustrated in Fig. 1.
2
Basic Segmentation of the HLS Space
All the pixels of a given image can be represented in a color space that is a 3D space. Usually, the information captured by a video camera is made of (red, green, blue) quantified values. But this color information can be represented in a more significant space such as the HLS (Hue, Lightness, Saturation) one, for example (it provides the expression of the hue that seems determinant in this context). The first idea we developed was to select points on the basis of a threshold in the natural discrete HLS space. Natural means that we use the set of values associated with the quantified number of (R,G,B) values. The algorithm is very
Soccer Field Detection in Video Images Using Color and Spatial Coherence
267
simple: we count the values occurrences in the HLS space and we keep those with the higher score. It is very simple but it is not efficient at all because of various possible local value distributions. Fig. 2 illustrates this effect: we have selected pixels that occur at least 16 times and we can see that the result is not satisfactory.
Fig. 2. Threshold on points occurrences in the HLS space
These remarks inclined us to use a different approach to analyze the point distribution in the HLS space. We consider this space as a more discrete one (i.e. with a lower granularity level) and we evaluate the contribution of the points (representing the pixels) to each cell (volume element of the discrete HLS space). Then we keep the most significant cells to provide the selection.
3
A Discrete Representation of the HLS Color Space
We represent the HLS space as a set of volume elements each of them being defined by a set of constraints on the H, L and S values. We have chosen the following repartition that provides an interesting discrete model made of 554 cells (each line gives the number of intervals in H, in S and globally for each interval in L).
268
A. Le Troter, S. Mavromatis, and J. Sequeira
This discrete representation of the HLS space is illustrated by a set of juxtaposed polyhedrons (Fig. 3) or by a set of small spheres centered in each polyhedron (Fig. 4).
Fig. 3. Polyhedrons representing some of the cells
Fig. 4. Complete discrete representation of HLS space
It implies the following neighborhood relations between the cells (Fig. 5 to 9).
Fig. 5. 8-connectivity (vertices)
Fig. 6. 12-connectivity (edges)
Fig. 7. 24-connectivity (half-edges)
Fig. 8. 6-connectivity (faces)
Fig. 9. 12-connectivity (faces)
It improves the initial selection but it still does not provide a full satisfactory result. Fig. 10 and 12 show the decimation process based on a threshold on the cell density and it illustrates an undesired effect: the green cell the most on the right (the most saturated) disappears before the cells along the luminance axis. The effect on the selection is visible on Fig. 11 and 13. We can notice that only using a threshold on the density is not the best solution. The reason is that in this approach, we do not consider the interactions between cells and consequently, we do not fully take advantage of the coherence in the color space. In order to avoid it, we have decided to induce interactions through cells by considering that each point is a source of potential and by evaluating the generated potential at the center of each cell.
Soccer Field Detection in Video Images Using Color and Spatial Coherence
Fig. 10.
4
Fig. 11.
Fig. 12.
269
Fig. 13.
Potential Sources in the Color Space
The potential emitted by each point has a maximum value at this point (the source), it has a zero value beyond a given distance (the range), and it is linear in between. The range is not the same for all the cells because they do not have the same size. Let us consider R= D ÷ 2 where D is the diameter of the cell to which the point belongs (the diameter is the highest distance between two points of the cell). The range of the potential function D_Max is evaluated as where varepsilon represents the increasing percentage on R and has not to be a small value. Each point contributes to increase the density of a cell if its distance in the center of the area is lower than D_Max. The weaker this distance is, the more the contribution is strong.
Fig. 14.
Fig. 15.
Fig. 14 and 15 show the emissions area (in yellow with a variable intensity) in which a given point brings a contribution. For example P1 has a stronger potential value than P2 on the grayed area while P3 has a null potential for this area. Figures below illustrate how the use potential sources improves the pixel selection through the color space: the cells along axis luminance is eliminated by thresholding before some of the green ones that have a higher value thanks to the contributions of their neighbors (color coherence).
270
A. Le Troter, S. Mavromatis, and J. Sequeira
Fig. 16.
5
Fig. 17.
Fig. 18.
Fig. 19.
Automatic Thresholding and Cell Selection
Up to now, we only considered a manual selection of the threshold value. Let us describe the classical but efficient algorithm we use to automatically select it. Let us consider the distribution of the potential values in the 554 areas of the discrete HLS space. Most values are low ones and a few of them are high ones. We sort these values by decreasing order and we evaluate the function F that gives their cumulated value (i.e. F(1) is the highest potential, F(2) is the sum of the two highest potentials, and so on). This function starts at 0, it strongly varies on a few values and then it slowly varies to the value 554 of the variable. The breakpoint of the representative curve is significant of the threshold value. This breakpoint is obtained through the segmentation in two parts of the curve using a split and merge algorithm.
Once we have automatically determined the threshold value, we start from the area that has the highest density; then, we look for other areas that are connected (neighborhood relations have been defined previously (Fig. 5 to 9)) until we reach the number of areas determined through the analysis of the cumulated histogram (Table 2). The result is a set of areas in the discrete HLS space as represented below (Fig. 20).
Soccer Field Detection in Video Images Using Color and Spatial Coherence
271
Fig. 20. Set of automatically selected areas in the discrete HLS space
6
Using Area Coherence
We now have to use the area coherence to definitively determine the Region of Interest (the field). There are two kinds of elements on the field: the players (and the ball) and the white lines. The first ones produce holes and the second ones split the Region of Interest into different connected components that are very close one to each other. First, we apply to this image a closing with a structuring element which size is over the usual line width: it provides the connection between all these connected components. Then, we apply to the result an opening with the slightly bigger structuring element in order to eliminate all the non-significant elements that are close to the border. Finally, we keep the connected component that contains the most pixels and we fill its holes to obtain the region of Interest (Fig. 21).
Fig. 21. The red line outlines the border of the region of interest
7
Conclusion
The approach described in this paper has been developed in the frame of a project in which we can identify a specific background mainly through its color coherence. This approach is quite robust for this application. It could be interesting to develop it in a more general frame (for industrial vision) when various backgrounds are present in the image or when the main component in the color space extends on a wide area.
272
A. Le Troter, S. Mavromatis, and J. Sequeira
References 1. S. Mavromatis, J. Baratgin, J. Sequeira, “Reconstruction and simulation of soccer sequences,” MIRAGE 2003, Nice, France. 2. S. Mavromatis, J. Baratgin, J. Sequeira, “Analyzing team sport strategies by means of graphical simulation,” ICISP 2003, June 2003, Agadir, Morroco. 3. H. Ripoll, “Cognition and decision making in sport,” International Perspectives on Sport and Exercise Psychology, S. Serpa, J. Alves, and V. Pataco, Eds. Morgantown, WV: Fitness Information technology, Inc., 1994, pp. 70-77. 4. L. Saitta and J. D. Zucker, “A model of abstraction in visual perception,” Applied Artificial Intelligence, vol. 15, pp. 761-776, 2001. 5. K. A. Ericsson and A. C. Lehmann, “Expert and exceptional performance: Evidence of maximal adaptation to task constraints,” Annual Review of Psychology, vol. 47, pp. 273-305, 1996. 6. B. Bebie, “Soccerman : reconstructing soccer games from video sequences,” presented at IEEE International Conference on Image Processing, Chicago, 1998. 7. M. Ohno, Shirai, “Tracking players and estimation of the 3D position of a ball in soccer games,” presented at IAPR International Conference on Pattern Recognition, Barcelona, 2000. 8. C. Seo, Kim, Hong, “Where are the ball and players ? : soccer game analysis with colorbased tracking and image mosaick,” presented at IAPR International Conference on Image Analysis and Processing, Florence, 1997. 9. S. Carvalho, Gattas, “Image-based Modeling Using a Two-step Camera calibration Method,” presented at Proceedings of International Symposium on Computer Graphics, Image Processing and Vision, Rio de Janeiro, 1998. 10. A. Amer, A. Mitiche, E. Dubois, “Context independent real-time event recognition: application to key-image extraction.” International Conference on Pattern Recognition Québec (Canada) August 2002. 11. S. Lefvre, L. Mercier, V. Tiberghien, N. Vincent, “Multiresolution color image segmentation applied to background extraction in outdoor images.” IST European Conference on Color in Graphics, Image and Vision, pp. 363-367 Poitiers (France) April 2002. 12. N. Vandenbroucke, L. Macaire, J. Postaire, “Color pixels classification in an hybrid color space.” IEEE International conference on Image Processing, pp. 176180 Chicago 1998. 13. Carron, “Segmentation d’images couleur dans la base Teinte Luminance Saturation : approche numérique et symbolique.” Université de Savoie, 1995 14. N. Vandenbroucke, L. Macaire, J. Postaire, “Segmentation d’images couleur par classification de pixels dans des espaces d’attributs colorimétriques adaptés: Application l’analyse d’images de football.” Lille: Université des Sciences et Technologies de Lille, 2000, pp. 238.
New Methods to Produce High Quality Color Anaglyphs for 3-D Visualization Ianir Ideses and Leonid Yaroslavsky Tel-Aviv University, Department of Interdisciplinary Studies, 69978 Ramat Aviv, Israel
[email protected],
[email protected]
Abstract. 3D visualization techniques have received gaining interest in recent years. During the last years, several methods for synthesis and projection of stereoscopic images and video have been developed. These include autostereoscopic displays, LCD shutter glasses, polarization based separation and anaglyphs. Among these methods, anaglyph based synthesis of 3D images provides a low cost solution for stereoscopic projection and allows viewing of stereo video content where standard video equipment exists. Standard anaglyph based projection of stereoscopic images, however, usually yields low quality images characterized by ghosting effects and loss of color perception. In this paper, methods for improving quality of anaglyph images, as well as conservation of the image color perception are proposed. These methods include image alignment and use of operations on synthesized depth maps. Keywords: Stereoscopic, Anaglyphs, Depth map
1 Introduction Recent years have shown a dramatic increase in development of stereoscopic projection devices. These include autostereoscopic displays, polarization based projectors, temporally multiplexing LCD glasses and color based filtering (anaglyphs). Autostereoscopic displays are considered the high-end solution for 3D visualization [1]. These devices exhibit excellent stereo perception in color without the need for viewing aids. Overall cost and viewing area limitations, however, have inhibited the market share of such devices, rendering their market acceptance marginal. Anaglyph based 3D projection, on the other hand, is considered as the lowest cost stereoscopic projection technique. This method can be implemented on standard monitors without any modification. Methods to overcome the drawbacks of this technique, such as loss of color perception and discomfort in prolonged viewing, are addressed in this paper A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 273–280, 2004. © Springer- Verlag Berlin Heidelberg 2004
274
I. Ideses and L. Yaroslavsky
2 Generating Color Anaglyphs and Two Methods for Quality Improvement Viewing anaglyph stereo images requires special glasses. These glasses are made from two color filters, generally red for the left eye and blue for the right. As red and blue are both primary colors, the stereoscopic image pair can easily be divided to its RGB components. By combining red and blue components from the two images, a new stereoscopic image can be obtained and projected. Usually color qualities of the obtained image are degraded. For many applications, color information is discarded. Indeed, anaglyphs created using grayscale stereo pairs are more commonly available than color anaglyphs. It is also possible to generate color anaglyphs by using all color components. This can be achieved by exchange of one or two of the color components between the stereo pair. Most implementations exchange the red component of the right image of the stereo pair with the red component from the left image. In this manner, color properties of the image are preserved for 2D viewing. Examples of these anaglyph representations are shown in Fig. 1.
Fig. 1. Left image is an anaglyph created from a grayscale stereo pair, right image is an anaglyph created by using all color components. Use red filter for left eye, blue filter for right eye.
One can see that conventional anaglyphs illustrated in Fig. 1 do not provide high quality images. One can substantially improve the quality of color anaglyphs by means of image alignment [2-3]. In this method the red component from the left image is aligned with the green and blue components from the right image. This provides increased quality by reduction of ghosting artifacts. Examples of anaglyph enhancement by alignment can be seen in Fig. 2. This alignment of color components in the resulting anaglyph can be achieved using correlational techniques for finding the disparity between the images in the stereo pair. This method, however, does not guarantee that the main object in the image will be aligned. It may happen that alignment will take place for background objects. Using target location techniques such as matched filtering and adaptive correlators and interactive selection of the foreground object in images it is possible to achieve significantly better results and produce anaglyphs with little non-overlapping areas [2].
New Methods to Produce High Quality Color Anaglyphs for 3-D Visualization
275
Fig. 2. Top image is standard anaglyph. Bottom image is an anaglyph enhanced by alignment. Note the diminished ghosting effects.
Additional improvement in visual perception of color anaglyphs with unaided eyes can be achieved by component blurring. It has been shown that substantial defocusing and discarding of color information of one image of the stereo pair does not affect visual 3D perception [4]. We suggest that component blurring be used to decrease visibility of ghosting effects in anaglyphs observed with unaided eyes and to improve viewing comfort while retaining good stereo perception when viewing anaglyphs with color glasses. This method was experimentally tested and verified for images of different dominant colors [3]. Yet further improvement and increased flexibility in generating anaglyphs can be achieved by manipulating depth maps and synthesizing artificial stereo images.
3 Stereo Images Pairs and Depth Maps In order to view a stereoscopic image, it is necessary to project two images, each image to its respective eye. These images are referred to as a stereo pair. It is known that if the optical properties of the imaging devices are available, one can triangulate the objects within the stereo pair. The principle of triangulation is illustrated in Fig. 3.
276
I. Ideses and L. Yaroslavsky
The distance h to each pixel can be calculated as:
Where and are angles determined by the properties of the imaging devices and d is horizontal parallax between the eyes or cameras.
Fig. 3. Triangulation based on properties of the imaging devices.
For 3D perception, the important information is the horizontal disparity between the stereo pair images. This value represents the distance between the objects in the stereo pair and allows estimating the non-overlapping areas. Stereoscopic projection for visual observation, however, does not usually require high accuracy in depth perception. We suggest a three-step procedure to generate a stereo pair that allows greater control over disparity values and therefore quality of anaglyphs. a. Generating depth map from an available stereo pair. b. Modification of the depth map. c. Synthesizing an artificial stereo pair and displaying as anaglyph.
New Methods to Produce High Quality Color Anaglyphs for 3-D Visualization
277
In order to calculate the depth map of a stereo pair, it is necessary to calculate the position of every object pixel of the right image in the left image. Subtraction of these spatial coordinates provides the horizontal parallax between the images. The required positioning of the pixels between the images of the stereo pair can be achieved using known localization techniques. Among them are matched filtering, adaptive correlation and local adaptive correlation [6]. In our implementation, localization was performed by target matched filtering in a running window. For each pixel in the right image, a window of 16x16 pixels was taken and targeted in a larger window in the left image. This targeting window was selected to be within standard parallax values. In this process, a depth map is generated for every pixel of the right image. An example of a stereo pair and resulting depth map displayed as an image are illustrated in Fig. 4. One can now improve the quality of resulting anaglyphs by manipulating the depth map. As anaglyph quality increases with decrease of non-overlapping areas, it is in our interest to reduce the dynamic range of the disparity values. This can be implemented by compressing signal dynamic range by P-th law transformation [7]. In this transformation, every pixel of the depth map is subject to the following modification:
where is the new depth map pixel value, 0
4 Color Preservation for 3D Viewing It has been shown that a stereo pair composed from one color image and a blurred grayscale image enables color stereo perception as well [4-5]. 3D viewing of enhanced anaglyphs in color is therefore possible. One can achieve this by using a single color filter instead of a combination of 2 filters. By using a single color filter (for example a red filter for the left eye) it is possible to project a stereo pair composed from a grayscale image for the left eye and a color image to the right eye. The left eye image is the original left stereo pair image and the right image is the anaglyph itself. Although the right image of the stereo pair is the anaglyph rather than the original right image components, visual differences are marginal, resulting in a correctly fused 3D image. This phenomenon has been experimentally tested and verified (aligned images shown in this paper can also be viewed in this manner) [2-3].
278
I. Ideses and L. Yaroslavsky
Fig. 4. Resulting depth map is below the stereo pair
New Methods to Produce High Quality Color Anaglyphs for 3-D Visualization
279
Fig. 5. Top images are the original depth map and the resulting anaglyph. Bottom images are the depth map after dynamic range compression (P = 0.5) and the resulting anaglyph.
5 Conclusion Methods for synthesizing high quality color anaglyphs using standard video equipment are suggested and implemented: image alignment, component blurring and synthesis of artificial stereo using compressing the depth map dynamic range. These methods were experimentally verified and implemented for synthesis of 3D content.
References 1. Paul Kleinberger, Ilan Kleinberger, Hillel Goldberg, J.Y. (Yosh) Mantinband, John L. Johnson, James Kirsch, Brian K.Jones.: A full-time, full-resolution dual stereoscopic/autostereoscopic display. Proceedings of SPIE-IS&T Electronic Imaging, Vol. 5006, California, SPIE (2003)
280
I. Ideses and L. Yaroslavsky
2. http://www.eng.tau.ac.il/~yaro/Ideses/malca1.html 3. I. Ideses, L. Yaroslavsky.: Efficient compression and synthesis of stereoscopic video. Proceedings of Visualization, Imaging and Image Processing Conference, Spain (2002) 191194 4. L.P. Yaroslavsky.: On Redundancy of Stereoscopic Pictures. Image Science’85, Vol.149, Acta Polytechnica Scandinavica, Helsinki (1985) 82-85 5. L.P. Yaroslavsky.: Digital Signal Processing in Optics and Holography, Radio i Svyaz’ Publ. Moscow (1987) 6. Yaroslavsky L.,Eden M.: Fundamentals of digital optics. Birkhauser, Boston (1996) 7. Yaroslavsky L.: Digital Holography and Digital Image Processing. Kluwer Academic Publishers, Boston (2003)
A New Color Filter Array Interpolation Approach for Single-Sensor Imaging Rastislav Lukac1, Konstantinos N. Plataniotis1, and Bogdan Smolka2* 1
The Edward S. Rogers Sr. Dept. of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road, Toronto, M5S 3G4, Canada {lukacr, kostas}@dsp.utoronto.ca 2
Polish-Japanese Institute of Information Technology, Koszykowa 86 Str, 02-008 Warsaw, Poland
[email protected]
Abstract. This paper presents a new approach suitable for color filter array (CFA) interpolation schemes. The proposed method uses both scaling and shifting operations to normalize color components appearing in the CFA interpolator’s input. The utilization of the proposed model can significantly boost the performance of most well-known CFA interpolators which then exhibit superior performance and eliminates moire noise and color shifts in the full color output.
1
Introduction
In single-sensor imaging devices, a single charge-coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) sensor with a color filter array (CFA) is used to produce a two-dimensional array or mosaic of color components. Note that the Red-Green-Blue (RGB) Bayer CFA pattern (Fig.1a) [1] is commonly used due to the simplicity of the subsequent demosaicking procedure. Since only a single spectral component is available at each spatial location, the restored, high-resolution RGB color image output is obtained by interpolating the missing two color components from the spatially adjacent CFA data. This process is known as CFA interpolation or demosaicking and is an integral part of cost-effective single-sensor imaging devices such as digital cameras and imagingenabled wireless phones. [4],[6],[7],[12]. Powerful CFA-based interpolation methods use color models to avoid a variety of spectral problems. Well-known solutions such as the saturation based adaptive interpolation (SAI) scheme [2], the smooth hue transition (SHT) interpolation scheme [3], the Kimmel’s algorithm (KA) [5], the demosaicked image post-processing scheme [7] and the CFA image zooming scheme [8] employ the color-ratio model introduced in [3]. The use of the model allows to utilize both the spectral and spatial characteristics of the RGB image and thus, the schemes can reduce spectral artifacts (color shifts, moire, aliasing) in the restored image. *
This research has been supported by a grant No PJ/B/01/2004 from the Polish Japanese Institute of Information Technology
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 281–288, 2004. © Springer-Verlag Berlin Heidelberg 2004
R. Lukac, K.N. Plataniotis, and B. Smolka
282
2
Color-Ratio Model Based Demosaicking
Let us consider, a gray-scale image representing a twodimensional matrix of integer samples acquired by a sensor. The indices (for rows) and (for columns) are used to indicate the spatial location of the samples. Since the sensor is essentially a monochromatic device being only capable of obtaining a single measurement of luminance per spatial location, a CFA is used to separate incoming light into a specific spatial arrangement of color components. Assuming the Bayer CFA pattern with a GRGR phase in the first row [4],[7], the gray-scale pixels can be transformed into the RGB vectors with indicating the and component, as follows [8]:
Since only a single measurement varying in is available in each spatial location color vectors of a color image are completed using two zero components. Due to the double sampling frequency of the G components Bayer CFA demosaicking algorithms start the interpolation procedure with the G color plane. In order to proportionally represent the contribution of the adjacent color components most, demosaicking algorithms determine the missing G color component using neighboring G components as follows [5],[8]:
where
denotes the G component of the color vectors with denoting spatial location arrangements on the image lattice (Fig.1b). Each input component is associated with a non-negative weight reflecting the edge sensitivity of the demosaicking solution. Since of (2) is an unbiased solution, the weights are normalized to produce non-negative weighting coefficients It should be mentioned that non-adaptive demosaicking solutions such as SHT [3] utilize fixed weights Edge-sensing (adaptive) CFA-based solutions [2],[5],[8] use the weights calculated using some form of inverse gradients, since inversely proportional values appropriately penalize the associated inputs. In this way, for are used to regulate the contribution of the available color components inside the spatial arrangements described by Using the corresponding weights, the inputs which are not positioned across an edge are emphasized, and thus the interpolation process is directed along the edges. The use of (2) completes the G color plane of the image x. Since the G component is available in each spatial location, original R (or B) CFA locations correspond to available RG (or BG) components which can be used in
A New Color Filter Array Interpolation Approach
283
the subsequent demosaicking steps. Since natural images exhibit a high spectral correlation among color channels, the utilization of the spectral correlation that exits between R and G (or B and G) color bands allows to use more information from the color image x during demosaicking and reduce color artifacts in the demosaicked images. The schemes here under consideration, namely SHT, SAI, and KA use the color-ratio model of [3] to take the information from the different spectral bands. The color-ratio model is based on the assumption of hue uniformity enforced through R/G and B/G ratios in localized image areas [3],[5]. Thus, two RGB vectors and located at the neighboring spatial locations and should exhibits the identical hue characteristics resulting in the following expression [3],[12]:
The interpolated value of the R or B component is obtained as and analogously, the available R or B components can be used to obtain the G component using Since natural color images do not have large areas with uniform hue characteristics, the color-ratios available in the spatial neighborhood are smoothed as follows [2],[5],[8]:
where location
denotes the available G components placed at an interpolated The available RG (or BG) vectors are located at as shown in Figs.1c,d. Since this interpolation step does not produce all needed R (or B) components, (4) should be repeated with the RG (or BG) vectors located at (Figs.1e,f). Then, the procedure results in fully populated R and B color planes. If the CFA-based solutions are designed to utilize some type of the correction process [5],[7],[8], R or B color planes completed using (4) are used along with the original G CFA components to re-evaluate G components previously obtained without the use of the spectral information in (1). The interpolator determines the G component via:
with
illustrated in Fig.1b. If corresponds to the B CFA location, then is used. Otherwise, corresponds to the R CFA location and B components are used in (5). The components in (4) and in (5) are used to normalize the smoothed ratio quantities to the appropriate intensity range. Since these components, located at the center of the spatial neighborhood describe the structural
284
R. Lukac, K.N. Plataniotis, and B. Smolka
Fig. 1. Bayer CFA pattern (a) and the spatial arrangements (b-f) of the color components obtained during demosaicking.
content of the image, the normalizing operations impress the high-frequency portion of the image to the interpolator’s output.
3
Proposed Method
The color-ratio model (3) employed in (4) and (5) is based on the assumption of hue uniformity within a localized image area. It is not difficult to see that the model fails near edge transitions, where both the spectral and spatial correlation characteristics of the image vary significantly, resulting thus in color shifts present in the restored images [7]. It has been shown in [7] that the calculations of the color-ratios often results in singularities corresponding to strong color artifacts in the demosaicked image. To overcome the problem, the utilization of the normalized color-ratio solution based on linear scaling and shifting operations improves the model’s characteristics near the edge transitions while preserving the performance in uniform image areas [9]. This model uses a linear transformation utilized in color image enhancement to preserve the hue characteristics of the image [10]. Let be a positive scaling factor and denote a shift parameter. The procedure of [9] normalizes the color components to and to for Thus, the underlying color-ratio model (3) changes to the following expression [9]:
It is not difficult to see that for and the proposed normalized color-ratio model generalizes the conventional color-ratio model (1) of [3], and for (6) generalizes the model used in demosaicked image postprocessing [7]. Under the new model the R or B components at an interpolation location are calculated using The G component is obtained via This linear transformation normalizes the dynamic range of the color-ratios. Simple inspection reveals that in uniform image areas for any arbitrary value of and the normalized ratio is qualitatively
A New Color Filter Array Interpolation Approach
285
identical to the conventional ratio However, near edge transitions the use of linear scaling and shifting operations of [10] maps color-ratios closer to unity, enforcing the underlying modelling assumption for both uniform and detail-rich areas while the conventional color-ratio model fails introducing thus shifted colors [9]. To ensure the smooth characteristics of the interpolated image, the CFA interpolator averages the normalized ratios corresponding to spatially neighboring locations Therefore, the R or B component is determined as follows [9]:
while the G component
is outputted using [9]:
where denotes the area of support, i.e. illustrated in Figs.1b,e,f or corresponding to the masks depicted in Figs.1c,d. The normalized components in (7) and in (8) are used to normalize the operand of the weighted average operator from the ratio to normalized intensity domain. To recover the original intensities, the addition of ensures inverse shifting normalization of the outputted values and the procedure completes with inverse scaling normalization realized through the use of an inverse scaling factor The proposed model (6) enforces the hue uniformity considered as the underlying modelling assumption in a robust way [9]. Although the model normalizes discontinuities in the intensity to ensure the correct utilization of the spectral information, the interpolators weights are used to follow the edge information. Thus, the edge-sensing demosaicking schemes can produce excellent improvements compared against the case when the identical adaptive interpolators use the conventional color-ratio model.
4
Experimental Results
To examine the performance of the proposed method and facilitate comparisons with CFA interpolation schemes operating on the conventional color-ratio model, color images shown in Fig.2 are utilized. Note that the image Lighthouse is 768 × 512 in size, whereas the images Bikes and Sydney are 512 × 512 images. All test images have been captured using a three-sensor device and normalized to 8-bit per channel RGB representation. It should be emphasized that SHT and SAI use the spectral information to interpolate R or B components and thus, the new variants of these schemes can utilize (7), only. On the other hand, the new
286
R. Lukac, K.N. Plataniotis, and B. Smolka
Fig. 2. Test color images: (a) Lighthouse, (b) Bikes, (c) Sydney.
KA variant uses the spectral characteristics in (7) and (8) to produce the full color output. Based on the experimentation with a wide range of color images and were found to boost performance of the CFA interpolation schemes operating on the proposed normalized color-ratios. Following common practices in the research community, mosaic versions of the images are created by discarding color information in a GRGR phased Bayer CFA filter (Fig.1a), [4],[6],[7]. Demosaicked images are generated using each of the listed methods. The efficiency of the interpolation methods is measured, objectively, using the mean absolute error (MAE), the mean square error (MSE) and the normalized color difference (NCD) criterion [11].
Table 1 summarizes the objective results. It can be seen that standard SHT, SAI, and KA schemes, which operate on the conventional color-ratio model produce worse results compared to their variants operating on the normalized colorratio model. The impressive improvement is obtained for the KA which utilizes an iterative correction cycle. These results indicate that depending on the edge
A New Color Filter Array Interpolation Approach
287
Fig. 3. Enlarged parts of the results corresponding to the images Lighthouse (top row) and Sydney (bottom row). The original images (a) are compared against the output images (b-e) obtained using the KA scheme operating on (7) and (8) with: (b) and (c) and (d) and (e) and
sensing mechanism and the interpolation/correction steps employed the use of the normalized color-ratio models allows to design powerful demosaicking tools. Fig. 3 depicts the outputs obtained using the test image Lighthouse and the KA scheme operating on the proposed normalized color-ratio model (6) employed in (7) and (8). Visual inspection of the results shown in Fig.3b indicate that the use of the conventional setting and causes singularities in the ratio calculations resulting in strong color artifacts. The small shifts in the values of and eliminate this problem, however, color shifts are still present in the demosaicked outputs (Figs.3c,d). However, when the suggested setting of the normalizing parameters and is used, undesired-side effects are completely eliminated (Fig.3e). At the same time, the scheme produces the output with the highest fidelity to the original (Fig.3a). Fig.4 depicts enlarged parts of the image Bikes cropped in edge area which is usually problematic for demosaicking schemes. The use of conventional SAI and KA results in color shifts (Figs.4b,c). By employing the proposed model in both schemes, the new obtained variants (Figs.4d,e) clearly outperform their standard versions producing naturally colored images pleasing for viewing.
5
Conclusion
This paper presented CFA interpolation based on a normalized color-ratio model. The model takes advantages of both the linear scaling and shifting operations to normalize the color-ratios and ensure robust modelling assumptions in both flat and edge image regions. Employing the proposed model instead of the conventional color-ratio model, the popular CFA-based interpolation solutions can produce naturally colored images, pleasing for viewing.
288
R. Lukac, K.N. Plataniotis, and B. Smolka
Fig. 4. Enlarged parts of the results corresponding to the image Bikes: (a) original image, (b) conventional SAI output, (c) conventional KA output, (d) proposed SAI output, (e) proposed KA output.
Future research will focus on the introduction of the normalizing parameters and realized as positive-definite function of the color components inside the localized image area of support. It is expected that this modification will boost further the performance and provide additional flexibility.
References 1. Bayer, B.E.: Color imaging array. U.S. Patent 3 971 065, (1976) 2. Cai, C., Yu, T.H., Mitra, S.K.: Saturation-based adaptive inverse gradient interpolation for Bayer pattern images. IEE Proceedings - Vision, Image, Signal Processing 148 (2001) 202–208 3. Cok, D.R.: Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal. U.S. Patent 4 642 678 (1987) 4. Gunturk, B., Altunbasak, Y., Mersereau, R.: Color plane interpolation using alternating projections. IEEE Trans. Image Processing 11 (2002) 997–1013 5. Kimmel, R.: Demosaicing: image reconstruction from color CCD samples. IEEE Trans. Image Processing 8 (1999) 1221–1228 6. Longere, P., Zhang, X., Delahunt, P.B., Brainard, D.H.: Perceptual assessment of demosaicing algorithm performance. Proceedings of the IEEE 90 (2002) 123–132 7. Lukac, R., Martin, K., Plataniotis, K.N.: Demosaicked image postprocessing using local color ratios. IEEE Transactions on Circuit and Systems for Video Technology 14 (2004) 914–920 8. Lukac, R., Martin, K., Plataniotis, K.N.: Digital camera zooming based on unified CFA image processing steps. IEEE Transactions on Consumer Electronics 50 (2004) 15–24 9. Lukac, R., Plataniotis, K.N.: Normalized color-ratio modelling for CFA interpolation. IEEE Transactions on Consumer Electronics 50 (2004) 10. Naik, S.K., Murthy, C.A.: Hue-preserving color image enhancement without gamut problem. IEEE Transactions on Image Processing 12 (2003) 1591–1598 11. Plataniotis, K.N., Venetsanopoulos, A.N.: Color image processing and applications. Springer Verlag, 2000 12. Ramanath, R., Snyder, W.E., Bilbro, G.L., Sander III, W.A.: Demosaicking methods for Bayer color arrays. Journal of Electronic Imaging 11 (2002) 306–315
A Combinatorial Color Edge Detector Soufiane Rital and Hocine Cherifi LIRSIA, University of Burgundy, Faculty of Sciences Mirande - Dijon, France {soufiane.rital, hocine.cherifi}@u-bourgogne.fr
Abstract. In this paper, we present an edge detection approach in color image using neighborhood hypergraph. The edge structure is detected by a structural model. The Color Image Neighborhood Hypergraph (CINH) representation is first computed, then the hyperedges of CINH are classified into noise or edge based on hypergraph properties. To evaluate the algorithm performance, experiments were carried out on synthetic and real color images corrupted by alpha-stable noise. The results show that the proposed edge detector finds the edges properly from color images. Keywords: Graph, Hypergraph, Color space, Neighborhood hypergraph, noise detection.
1 Introduction Edge detection is a front-end processing step in most computer vision and image understanding systems. The accuracy and reliability of edge detection is critical to the overall performance of these systems. Earlier developments of edge detection are mostly based on direct application of the difference operation and could encounter difficulties when images are corrupted by noise. Much research has been carried out in the effort to detect edge structures in the presence of noise. One type of edge detector employs smoothing before using the difference operator, so as to offset the effects of noise. The use of color in edge detection increases the amount of information needed for processing which complicates the definition of the problem. A number of approaches have been proposed from processing individual planes to true vectorbased approaches. The use of color images adds one important step, image recombination, which can be inserted at different places. This insertion translates into performing some sets of operations on each color component. The intermediate results are then combined into a single output. The point at which recombination occurs is key to understanding the different categories of color edge detection algorithms: output fusion methods, multidimensional gradient methods, and vector methods. In output fusion methods, gray-scale edge detection is carried out independently in each color component; combining these results yields the final edge map. Multidimensional gradient methods are characterized by a single estimate of the orientation and strength of an edge at a point. The first such method belongs to Robinson [6], who also appears to have published the first paper on color A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 289–297, 2004. © Springer-Verlag Berlin Heidelberg 2004
290
S. Rital and H. Cherifi
edge detection. He computed 24 directional derivatives (8 neighbors × 3 components) and chose the one with the largest magnitude as the gradient. However, it was Di Zenzo [11] who wrote the classic paper on multidimensional gradients. His method was derived algebraically, but it is perhaps better explained in terms of matrices. A 2 × 2 matrix is formed from the outer product of the gradient vector in each component. These matrices are summed over all components, and the square root of the principal eigenvalue (i.e., the principal singular value) becomes the magnitude of the gradient. The corresponding eigenvector yields the gradient direction. This approach was used in various forms by Cumani [4]. In vector methods, the decomposition and recombination steps nullify each other; the vector nature of color is preserved throughout the computation. How to represent and use these vectors has varied greatly. Perhaps the most compelling work in vector methods so far has been that of Trahanias and Venetsanopoulos [10]. Their method used the median of a set of vectors, which is the vector in that set whose distance to all other vectors is minimized. Once the vector median has been determined, vectors in a neighborhood are sorted by their distances from the vector median, and various statistics are measured and used for edge detection. The algorithms that incorporate more vector operations are preferable to those with fewer [7]. In this paper, we propose a structural approach to edge detection which is robust even when the color image is corrupted by noise. We choose to use the alpha-stable noise distribution because it has an impulsive behavior. This approach can be classified into vector method category. The method is based on the structural model in color image neighborhood hypergraph representation. In hypergraph [2] the vertices V correspond to objects and hyperedges E represent the interrelations between these objects. Various applications have been proposed for gray scale image using Image Neighborhood Hypergraph representation (INH) [3][5]. The proposed color edge detector executes in two phases. The color image neighborhood hypergraph is first computed. Then, the edge and noise structures are detected by the structural model. In section 2, we first briefly review some background on hypergraph theory. Then, we define the neighborhood hypergraph associated with any color image, denoted by Color Image Neighborhood Hypergraph (CINH). The proposed edge detector algorithm is presented in Section 3 and its performance is illustrated in Section 4. Finally, the paper ends with a conclusion in Section 5.
2
Background on Hypergraph Theory
As our main interest in this paper is to use combinatorial models, we will introduce basic tools that are needed. A hypergraph on a set is a family of non-empty subsets of called hyperedges with: Given a graph G, the hypergraph having the vertices of G as vertices and the neighborhood of these vertices as hyperedges (including these vertices) is called
A Combinatorial Color Edge Detector
291
the neighborhood hypergraph of G. To each graph we can associate a neighborhood hypergraph : Let us note: A chain is a succession of the hyperedges It is disjoined if the hyperedges are not connected two by two. An hyperedge is isolated if and only if if then
2.1
Color Image and Neighborhood Relations
In this paper, the color image will be represented by the following mapping : for color image). Vertices of are called pixels, elements of are called color. A distance on defines a grid (a graph connected, regular, without both loop and multi-edge). Let be a distance on we have a neighborhood relation on an image defined by :
The neighborhood of on the grid will be denoted by So to each color image we can associate a hypergraph called Color Image Neighborhood Hypergraph CINH : On a grid to each pixel we can associate a neighborhood according to predicate The predicated may be completely arbitrary provided, it is useful for a task domain. It may be defined on a set of points, it may use colors, or some symbolic representation of a set of colors, or it may be a combination of several predicates, and so on. The thresholding can be carried out in two manners. In the first case, the threshold is given for all the pixels of the image, whereas in the second case, the threshold is generated locally then applied in an adaptive way to the unit of the pixels.
2.2
Color Distance
Before computing the color image neighborhood hypergraph representation, we must first compute the distance between each pair of colors in the neighborhood For this distance measure to agree with human perception, we must find a good combination of a color space and a function on this space. There is general agreement that the organization of color in our perceptual systems is three-dimensional, but the actual assignment of coordinates to colors depends on the task involved. As a result, many color spaces exist (see [8] for details on many of them). Few of these spaces were designed to mimic perceived color distances, however. In particular, the Red, Green, and Blue (RGB) color space has hardly ever been advocated as a good space for measuring color distances. One of the few color spaces that was designed from a perceptual standpoint is the color space [8]. CIE-Lab was constructed from the results of psychophysical color similarity experiments. The Euclidean distance between two nearby colors in this space is intended to be equivalent to their perceptual distance. In [9], G. Sharma presented a new color-difference formula CIEDE2000. The euclidean Distance (ED) is the metric usually used in N-dimensional vector space. It is defined as : where is the vector norm.
292
3
S. Rital and H. Cherifi
Edge Detector Algorithm
In this section, we describe a structural edge detection algorithm based on color image neighborhood hypergraph representation and structural model of edge and noise. In the figure 1, we illustrate the block diagram of the proposed algorithm. It starts firstly with a CINH generation and secondly by a noise and edge classifications. This classification is based on two structural models. To model a noisy hyperedge and an edge in color image, we propose the two following definitions: Definition 1. We say that is a noise hyperedge if it verifies one of the two conditions : (i) The cardinality of is equal to 1: is not contained in disjoined thin chain having elements at least. (ii) is an isolated hyperedge and there exists an element belonging to the open neighborhood of on the grid, such that is isolated. (i.e. is isolated and it has an isolated hyperedge in its neighborhood on the grid). Definition 2. For every If is not isolated, contained in a disjoined chain having elements at least and defined in the first definition (Def. 1) then
Fig. 1. The block diagram of the proposed color edge detector algorithm.
4
Results and Discussion
We shall present a set of experiments in order to assess the performance of the color edge detection algorithm we have discussed so far. Our goal in the first experiment is to evaluate the noisy hyperedges detection. In the second experiment, we evaluate the color edge structural model in no corrupted color images. Finally, we evaluate the color edge detection in corrupted color images. The distance measure between two vectors in a given color space is defined by euclidean distance. We used 256 × 256-pixel “Peppers”, “Logo” and “Fruit” images, all of them being true-color images (24 bits/pixel). We tested the performance of the noisy hyperedge detection described above in the presence of alpha-stable noise. This distribution is an useful model of noise distribution. For a symmetrical distribution, the characteristic function is given by : where
A Combinatorial Color Edge Detector
293
: (1) is the characteristic exponent satisfying The characteristic exponent controls the heaviness of the tails of the density function. The tails are heavier, and thus the noise more impulsive, for low values of while for a larger the distribution has a less impulsive behavior. (2) is the location parameter (3) is the dispersion parameter which determines the spread of the density around its location parameter. In the evaluation and comparison of the noisy hyperedge detection, two criteria are employed namely: probability of the decision and probability of the false alarm Let’s consider an image corrupted with a noise source and the true locations of the noise are stored for comparison. For each detector under consideration the and are computed by comparing them with the true set of noise. A good detector will have a high detection probability and low false alarm probability.
4.1
Noise Removal
In this section, we first evaluate the use of euclidean distance in two color spaces RGB and CIELab using the ROC curves of the noisy hyperedge detection at many threshold in [0,255] with and The used Peppers color image is corrupted by 6% of alpha-stable noise The better detection should have rates as large as possible, and false alarm rates as small as possible, i.e. a ROC curve that bows away from the diagonal line as much as possible. From this figure, we can see that in the the detection give a significant results over The results demonstrate the improvement introduced in terms of performance using a euclidean distance to compute CINH representation and superiority of CIELab space consequently.
Fig. 2. ROC curve of the proposed noisy hyperedge detection in Peppers color image corrupted by 6% of alpha-stable noise using CINH representation in RGB and CIELab color spaces.
294
S. Rital and H. Cherifi
After color space comparison, we evaluate the performance of the proposed noise model after estimating the noisy hyperedges. The objective of the filtering is to remove the noisy hyperedges while preserving the noise-free patterns. In figure 3, we present the results of the noise detection in Fruit color image corrupted by 6% of alpha stable noise with two parameters : representing a Gaussian distribution noise and representing a impulsive distribution noise. These two results are compared with the Vector Median Filter (VMF) [1]. The VMF operates using 3 × 3 square processing windows. From the error image between the filtered image and the original image, we note that the proposed algorithm preserve better the edge of the corrupted color than the VMF filter.
Fig. 3. Filtered results of Fruit color image. (a,a’) The images degraded by 6% of alpha-stable noise and The images (b,b’) are the output of the proposed noise model in CIELab color space with and respectively. (c,c’) the outputs of VMF filter. (d,d’),(e,e’) are the absolute error between the original color image and the filtered image of both VMF filter and proposed algorithm respectively (the RGB values where multiplied by factor 10).
4.2
Edge Detection
In this section, we present the results of edge detection in no corrupted color image in order to evaluate the structural model of edge initially on synthesized color images then on real color images. In figure 4(a,b,c), we illustrate first the edge detection results of the synthetic logo color image. This figure contains the edge maps of the proposed and the Cumani algorithms. These two algorithms could both detect significant edges presented in the Logo color image. Nevertheless, the Cumani algorithm did not detect the junctions of the majority of the objects in the image. In figure 4(a’,b’,c’), we illustrate the edge detection results of Peppers color image. According to these results, we note that the Cumani
A Combinatorial Color Edge Detector
295
algorithm detected more of false edges than the proposed algorithm. Let us note that the results of the proposed algorithm or the Cumani algorithm are obtained one adjusting their parameters in order to obtain less false edges and more significant edges. The detection of less false edges and the correct detection of the junctions proof superiority of the proposed algorithm.
Fig. 4. The edge detection results of Cumani and proposed algorithms. (a,a’) the original color images. (b,b’) the outputs of Cumani algorithm with the parameters TH = 2) and TH = 10) respectively. (c,c’) the outputs of the proposed algorithm with the parameters and respectively.
4.3
Edge Detection in Noisy Color Image
Two alpha-stable noise with and and 6% are added to the Logo and Peppers color images in order to evaluate robustness of the proposed method to noise effects. The two parameters and represent the Gaussian and Impulsive noise distributions respectively. The Gaussian distribution is added¡ to Logo color image while the impulsive distribution is added to Peppers color image. The edge detection results of this two color images are illustrated in figure 5. According to these results, we note that we found the two remarks illustrated on the results of Cumani algorithm in figure 4, this still confirms : the more false edges and the wrong detection of the junctions by Cumani algorithm. Concerning the robustness of the two algorithms to noise effects. We note that the Cumani algorithm is robust to the Gaussian noise while the proposed algorithm is robust to both impulsive and Gaussian noise.
296
S. Rital and H. Cherifi
Fig. 5. The edge detection results of Cumani and proposed algorithms. (a,a’) the corrupted color images with the parameters and respectively. (b,b’) the outputs of Cumani algorithm with the parameters TH = 6) and TH = 10) respectively. (c,c’) the outputs of the proposed algorithm with the parameters and respectively.
5
Conclusions
A new algorithm for edge detection based on structural model is proposed for color image corrupted or not with noise. The algorithm includes two phases: first CINH generation and edge or noise classification. Simulation experimentals have shown that it is consistent and reliable even when image quality is significantly degraded by impulsive noise or Gaussian noise. It is effective in both cancellation of noise while preserving details and features detection.
References 1. J. Astola, P. Haavisto, and Y. Neuvo. Vector median filters. Proc. of the IEEE, 78(4):678–689, April 1990. 2. C. Berge. Hypergraph. North holland, 1987. 3. A. Bretto, H. Cherifi, and D. Aboutajdine. Hypergraph imaging : an overview. Journal of the pattern recognition society, 35(l):651–658, 2002. 4. A. Cumani. Edge detection in multispectral images. Computer Vision, Graphics, and Image Processing, 8(6):40–51, 1991. 5. S. Rital, A. Bretto, H. Cherifi, and D. Aboutajdine. A combinatorial based technique for impulsive noise removal in images. Image Processing and Communication journal, 1(1):3–4, January 2001. 6. G. Robinson. Color edge detection. Optical Eng., 16(5):479–484, Nov. 1977. 7. A. Mark Ruzon and Carlo Tomasi. Edge, junction, and corner detection using color distributions. IEEE Transactions On Pattern Analysis And Machine Intelligence, 23(11):1281–1295, Nov. 2001. 8. G. Sharma. Color Imaging Handbook. Ed., CRC Press, 2003.
A Combinatorial Color Edge Detector
297
9. G. Sharma, W. Wu, and E. N. Dalal. The ciede2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. to appear in Color Research and Application, April 2004. 10. P.E. Trahanias and A.N. Venetsanopoulos. Vector order-statistics operators as color edge detectors. IEEE Trans. Systems, Man and Cybernetics, B-26(1):135– 143, Feb. 1996. 11. S. Di Zenzo. A note on the gradient of a multi-image. CVGIP, 33(1):116–125, January 1986.
A Fast Probabilistic Bidirectional Texture Function Model Michal Haindl and Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague, Czech Republic {haindl,filipj}@utia.cas.cz
Abstract. The bidirectional texture function (BTF) describes rough texture appearance variations due to varying illumination and viewing conditions. Such a function consists of thousands of measurements (images) per sample. Resulted BTF size excludes its direct rendering in graphical applications and some compression of these huge BTF data spaces is obviously inevitable. In this paper we present a novel fast probabilistic model-based algorithm for realistic BTF modelling allowing such an efficient compression with possibility of direct implementation inside the graphics card. The analytical step of the algorithm starts with the BTF space segmentation and range map estimation of the BTF surface followed by the spectral and spatial factorisation of selected sub-space multispectral texture images. Single monospectral band-limited factors are independently modelled by their dedicated causal autoregressive models (CAR). During rendering the corresponding sub-space images of arbitrary size are synthesised and both multispectral and range information is combined in a bump mapping filter of the rendering hardware according to view and illumination positions. The presented model offers huge BTF compression ratio unattainable by any alternative samplingbased BTF synthesis method. Simultaneously this model can be used to reconstruct missing parts of the BTF measurement space.
1 Introduction A physically correct virtual models visualisation cannot be accomplished without nature-like colour textures covering virtual or augmented reality (VR) scene objects. These textures can be either smooth or rough (also referred as the bidirectional texture function - BTF [1]). The rough textures which have rugged surfaces do not obey the Lambert law and their reflectance is illumination and view angle dependent. Both types of textures occur in virtual scenes models can be either digitised natural textures or textures synthesised from an appropriate mathematical model. The former simplistic option suffers among others with extreme memory requirements for storage of a large number of digitised crosssectioned slices through different material samples. Sampling solution become unmanageable for rough textures which require to store thousands of different illumination and view angle samples for every texture. Such a simple VR scene requires to store tera bytes of texture data which is far out of limits for any A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 298–305, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Fast Probabilistic Bidirectional Texture Function Model
299
current real time hardware. Several intelligent sampling methods [2], [3], applied either to smooth or BTF data were proposed with the aim to diminish these huge memory requirements. All these methods are based on some sort of original small texture sampling and the best of them produce very realistic synthetic textures. However these methods require to store thousands images for every combination of viewing and illumination angle of the original target texture sample, they often produce visible seams, some of them are computationally demanding and they cannot generate textures unseen by these algorithms. Synthetic textures are far more flexible, extremely compressed (few tens of parameters have to be stored only), they may be evaluated directly in procedural form and can be designed to meet certain constraints or properties, so that they can be used to fill an infinite texture space without visible discontinuities. Several smooth texture modelling approaches were published, e.g., [4], [5] [6], [7] and some survey articles are available [8]. The random field based models quite successfully represent high frequencies present in natural textures low frequencies are much more difficult for them. One possibility how to overcome this drawback is to use a multiscale random field model. Unfortunately autoregressive random fields, similarly as the majority of other Markovian types of random field models, are not invariant to multiple resolution decomposition (MRD) even for simple MRD like subsampling and the lower-resolution images generally lose their autoregressive property (or wide-sense Markovianity property) and become ARMA random fields instead. Fortunately we can avoid computationally demanding approximations of an ARMA multigrid random field by an infinite order (i.e., high order in practice) autoregressive random fields because there is no need to transfer information between single spatial factors hence it is sufficient to analyse and synthesise each resolution component independently. The rest of the paper is organised as follows. The following section describes probabilistic BTF synthesis algorithm composed from BTF space segmentation, range-map estimation and multiscale smooth texture synthesis. We discuss the underlaying CAR model parameter estimation and synthesis solutions. Results are reported in the section 3, followed by a conclusions in the last section.
2
BTF Model
We propose a novel algorithm Fig.1 for efficient rough texture modelling which combines an estimated range map with synthetic multiscale smooth textures generated using a set of simultaneous causal autoregressive random (CAR) field based models. The material visual appearance during changes of viewing and illumination conditions is simulated using the bump-mapping technique. This idea of perturbing surface normals using a range-map was introduced by Blinn [9]. The overall appearance is guided by the corresponding underlying sub-space model. The obvious advantage of this solution is the possibility to use hardware support of bump technique in contemporary visualisation hardware.
300
M. Haindl and J. Filip
Fig. 1. The overall BTF algorithm scheme.
2.1
BTF Space Segmentation
A single probabilistic BTF model combined with the bump-mapping filter can potentially approximate the whole BTF measurement space. However approximation errors for significantly different illumination and viewing angles can worsen visual realism of certain virtual textured object faces. Thus we propose to compromise extreme compression ratio for the visual quality and to use several probabilistic BTF subspace dedicated models. The off-line part of our algorithm (see Fig.1) starts with the BTF space segmentation into several subspaces using the simple K-means algorithm, in representation of 81 view × 81 illum. positions, using colour histogram data features. Such a representation is numerically more efficient than the alternative texture spatial representation because the sample resolution is usually larger than the number of angle discretization steps. The eigen value analysis (PCA) leads us to the conclusion that the intrinsic BTF space dimensionality for most BTF texture samples is between ten and twenty. Hence the first 10–20 eigen-values contain 95% of the whole information. Several BTF spaces segmentations for different materials are depicted in Fig.2.
2.2
Range Map Estimation
The overall roughness of a textured surface significantly influences a BTF texture appearance. Such a surface can be specified using its range map, which can be estimated by several existing approaches. The most accurate range map can be estimated by direct measurement of the observed surface using corresponding range cameras, however this method requires special hardware and measurement methodology. Hence alternative approaches for range map estimation from surface images are more appropriate. One of the most accurate approaches used in this paper is the Photometric Stereo which estimates surface range map from at least three images obtained for different position of illumination source while the camera position is fixed. In the case of full BTF measurement space such mutually registered images are always available for free. The photometric stereo enables to acquire the normal and albedo fields from no less than three intensity images while Lambertian opaque surface is assumed. For details see [10].
A Fast Probabilistic Bidirectional Texture Function Model
301
Fig. 2. Sub-space index images for four different mate- Fig. 3. Lacquered wood rials: leather (12 clusters), fabric (13 clusters), wool (15 synthesised BTF rendered on the sphere. clusters) and lacquered wood (15 clusters).
2.3
Smooth Texture Model
Modelling general multispectral (e.g., colour) texture images requires three dimensional models. If a 3D data space can be factorised then these data can be modelled using a set of less-dimensional 2D random field models, otherwise it is necessary to use some 3D random field model. Although full 3D models allows unrestricted spatial-spectral correlation modelling its main drawback is large amount of parameters to be estimated. The factorisation alternative is attractive because it allows using simpler 2D data models with less parameters. Spectral Decorrelation. A real data space can be decorrelated only approximately, hence the independent spectral component modelling approach suffers with some loss of image information, however this loss of spectral information is only visible in textures with many substantially different colours. Spectral factorisation using the Karhunen-Loeve expansion transforms the original centered data space defined on the rectangular M × N finite lattice I into a new data space with K-L coordinate axes This new basis vectors are the eigenvectors of the second-order statistical moments matrix where the notation has the meaning of all possible values of the corresponding (spectral) index, the multiindex has two components the first component is row and the second one column index, respectively. The projection of random vector onto the K-L coordinate system uses the transformation matrix
which has single rows that are eigenvectors of the matrix Components of the transformed vector are mutually uncorrelated and if we assume that they are also Gaussian then they are independent thus each transformed monospectral factor can be modelled independently of remaining spectral factors. Spatial Factorization. Input multispectral image is factorised into monospectral images for These components are further decomposed into a multi-resolution grid and each resolution data are independently modelled by their dedicated CAR models. Each one generates a single spatial frequency band of the texture. An analysed texture is decomposed into multiple resolutions factors using Laplacian pyramid and the intermediary Gaussian
302
M. Haindl and J. Filip
pyramid which is a sequence of images in which each one is a low-pass down-sampled version of its predecessor. The Gaussian pyramid for a reduction factor is
where denotes down-sampling with reduction factor and is the convolution operation. The convolution mask based on weighting function (FIR generating kernel) is assumed to execute separablity, normalization, symmetry and equal contribution constrains. The FIR equation is then
The Laplacian pyramid contains band-pass components and provides a good approximation to the Laplacian of the Gaussian kernel. It can be constructed by differencing single Gaussian pyramid layers:
where
is the up-sampling with an expanding factor
CAR Factor Model. Single orthogonal monospectral components (the spectral index i is further omitted) are thus decomposed into a multi-resolution grid and each resolution data are independently modelled by their dedicated independent Gaussian noise driven autoregressive random field model (CAR) as follows. The causal autoregressive random field (CAR) is a family of random variables with a joint probability density on the set of all possible realisations Y of the M × N lattice I, subject to following condition:
where
where
is a unit vector and the following notation is used
and
The 2D CAR model can be expressed as a stationary causal uncorrelated noise driven 2D autoregressive process:
where is the parameter vector, is a causal neighbourhood with and is a white Gaussian noise with zero mean and a constant but unknown variance and is a corresponding vector of
A Fast Probabilistic Bidirectional Texture Function Model
303
Parameter Estimation. Parameter estimation of a CAR model using the maximum likelihood, the least square or Bayesian methods can be found analytically. The Bayesian parameter estimations of the causal AR model with the normal-gamma parameter prior which maximise the posterior density are:
and where and matrices are from parameter prior. The estimates (7) can be also evaluated recursively if necessary. Model Synthesis. The CAR model synthesis is very simple and a causal CAR random field can be directly generated from the model equation (6). Single CAR models synthesize spatial frequency bands of the texture. Each monospectral fine-resolution component is obtained from the pyramid collapse procedure (inversion process to (2),(4)). Finally the resulting synthesized colour texture is obtained from the set of synthesized monospectral images using the inverse K-L transformation. If a single visualized scene simultaneously contains BTF texture view and angle combinations which are modelled by different probabilistic models (i.e., models supported by different BTF subspaces) for the same material all such required textures are easily synthesized simultaneously. Simultaneous synthesis allows to avoid difficult texture registration problems. During rendering the four closest BTF view and illumination positions from the BTF sub-space index, computed during segmentation, are taken along with corresponding weights. The resulted image is interpolation of these synthesized images. The resulted image is finally combined with the range-map according to illumination direction and sent to a polygon being processed.
3
Results
We have tested the algorithm on BTF colour measurements mostly from the University of Bonn such as upholstery, lacquered wood, knitwear or leather textures. Each BTF material sample comprised in Bonn database is measured in 81 illumination and viewing angles, respectively and has resolution 800 × 800. Fig.4 demonstrates synthesised results for three different materials: fabric, wool and leather rendered on cylinders consisting of 64×64 polygon mesh. Fig. 3 illustrates the lacquered wood smooth BTF synthesis rendered on the sphere (without bump). Even if the analysed textures violates the CAR model stationarity assumption, the proposed algorithm demonstrates its ability to model such BTF textures with acceptable visual realism. We used only low order CAR models for experiments in Figs.4 with less than 7 contextual neighbours. Figures above demonstrate also the colour quality of the model. Our synthesised images manifest comparable colour quality with the much more complex 3D models [5],[7]. The overall BTF space in our parametric extremely compressed representation requires 300 KB space (range-map, sub-space index, 2D CAR model parameters). BTF measurements per sample for the Bonn data are 40GB hence
304
M. Haindl and J. Filip
Fig. 4. Three distinct BTF materials - fabric, wool and leather are rendered on a cylinder using the proposed BTF model combined with the bump mapping filter.
we are able to reach the compression ratio 1 : Moreover the proposed algorithm is very fast, since BTF synthesis of sub-space images from 2D CAR model parameters takes up about 3 seconds in average on Athlon 1.9GHz. In our case the bump mapping was computed on CPU, but fast implementation on GPU using fragment programs can be exploited as well [11].
4
Summary and Conclusions
The test results of our algorithm on available BTF data are encouraging. Some synthetic textures reproduce given measured texture images so that both natural and synthetic texture are almost visually indiscernible. The overwhelming amount of original colour tones were reproduced realistically in spite of restricted spectral modelling power of the model due to spectral decorrelation error. The proposed method allows huge compression ratio (unattainable by alternative intelligent sampling approaches) for transmission or storing texture information while it has low computation complexity. The method does not need any time-consuming numerical optimization like for example the usually employed Markov chain Monte Carlo methods. The CAR model synthesis is much faster than the MRF model synthesis even faster than the exceptionally fast Gaussian Markov random field model [12] implemented on a toroidal underlying lattice. The CAR model is better suited for real time (e.g. using a graphical card processing unit) or web applications than most other published alternatives. The presented technique allows to avoid difficult texture registration problem with simple simultaneous synthesis of all necessary sub-space textures. The method allows also a BTF data space restoration, in the extreme situation the whole BTF space can be approximated from a single measurement. The bump filter can be implemented using fragment programs with the novel graphics cards to
A Fast Probabilistic Bidirectional Texture Function Model
305
speed up the rendering of our algorithm results. The presented method is based on the estimated model in contrast to prevailing intelligent sampling type of methods, and as such it can only approximate realism of the original measurement. However it offers unbeatable data compression ratio (tens of parameters per BTF only), easy simulation of even non existing (previously not measured) BTF textures and fast seamless synthesis of any texture size. Acknowledgements. This research was supported by the EC projects no. IST-2001-34744 RealReflect, FP6-507752 MUSCLE, grants No.A2075302, T400750407 of the Grant Agency of the Academy of Sciences CR and partially by the grant MŠMT No. ME567 MIXMODE. The authors wish to thank R. Klein of the University of Bonn for providing us with the BTF measurements.
References 1. Dana, K., van Ginneken, B., Nayar, S., Koenderink, J.: Reflectance and texture of real-world surfaces. ACM Transactions on Graphics 18 (1999) 1–34 2. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: Proc. Int. Conf. on Computer Vision (2). (1999) 1033–1038 3. Heeger, D., Bergen, J.: Pyramid based texture analysis/synthesis. In: Proc. SIGGRAPH 95, ACM (1995) 229–238 4. Haindl, M., Multiresolution colour texture synthesis. In Dobrovodský, K., ed.: Proceedings of the 7th International Workshop on Robotics in Alpe-AdriaDanube Region, Bratislava, ASCO Art (1998) 297–302 5. Bennett, J., Khotanzad, A.: Multispectral random field models for synthesis and analysis of color images. IEEE Trans. on Pattern Analysis and Machine Intelligence 20 (1998) 327–332 A multiresolution causal colour texture model. In Ferri, 6. Haindl, M., F.J., Inesta, J.M., Amin, A., Pudil, P., eds.: Advances in Pattern Recognition, Lecture Notes in Computer Science 1876. Springer-Verlag, Berlin (2000) 114 –122 A multiscale colour texture model. In Kasturi, R., 7. Haindl, M., Laurendeau, D., Suen, C., eds.: Proceedings of the 16th International Conference on Pattern Recognition, Los Alamitos, IEEE Computer Society (2002) 255–258 8. Haindl, M.: Texture synthesis. CWI Quarterly 4 (1991) 305–331 9. Blinn, J.: Models of light reflection for computer synthesized pictures. In: Computer Graphics Proceedings, Annual Conference Series, ACM SIGGRAPH (1977) 192–198 10. Woodham, R.: Analysing images of curved surface. Artificial Intelligence 17 (1981) 117–140 11. Welsch, T.: Parallax mapping with offset limiting: A per-pixel approximation of uneven surfaces. Technical Report Revision 0.3, http://www.infiscape.com/rd.html (2004) 12. Haindl, M. Filip, J.: Fast btf texture modelling. In Chandler, M., ed.: Texture 2003, IEEE Computer Society (2003) 13. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 10 (1988) 439–451
Model-Based Texture Segmentation Michal Haindl and Stanislav Mikeš Institute of Information Theory and Automation Academy of Sciences CR, 182 08 Prague, Czech Republic {haindl,xaos}@utia.cas.cz
Abstract. An efficient and robust type of unsupervised multispectral texture segmentation method is presented. Single decorrelated monospectral texture factors are assumed to be represented by a set of local Gaussian Markov random field (GMRF) models evaluated for each pixel centered image window and for each spectral band. The segmentation algorithm based on the underlying Gaussian mixture (GM) model operates in the decorrelated GMRF parametric space. The algorithm starts with an oversegmented initial estimation which is adaptively modified until the optimal number of homogeneous texture segments is reached.
1
Introduction
Segmentation is a fundamental process affecting the overall performance of an automated image analysis system. Image regions, homogeneous with respect to some usually textural measure, which result from a segmentation algorithm are analysed in subsequent interpretation steps. Texture-based image segmentation is area of intense research activity in recent years resulting in large amount of published algorithms. These methods are usually categorized [1] as regionbased, boundary-based, or as a hybrid of the two. Different published methods are difficult to compare because of lack of a comprehensive analysis together with accessible experimental data, however available results indicate that the texture segmentation problem is still far from being solved. Spatial interaction models and especially Markov random fields-based models are increasingly popular for texture representation [2], [1], [3], [4], etc. Several researchers addressed also the difficult problem of unsupervised segmentation using these models see for example [5], [6], [7] or [8].
2
Texture Representation
Adequate representation of general multispectral textures requires three dimensional models. Although full 3D models allows unrestricted spatial-spectral correlation description its main drawback is large amount of parameters to be estimated and in the case of Markov random field based models (MRF) also the necessity to estimate all these parameters simultaneously. The factorization alternative used in this paper is attractive because it allows using combination of A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 306–313, 2004. © Springer-Verlag Berlin Heidelberg 2004
Model-Based Texture Segmentation
307
several simpler 2D data models with less parameters per model. Natural measured texture data space can be decorrelated only approximately thus the independent spectral component representation suffers with some loss of image information. However the segmentation is less demanding application than texture synthesis, it is sufficient if such a representation maintains discriminative power of the full model even if its visual modelling strength is slightly compromised. 2.1
Spectral Factorization
Spectral factorization using the Karhunen-Loeve expansion transforms the original centered data space defined on the rectangular M × N finite lattice I into a new data space with K-L coordinate axes This new basis vectors are the eigenvectors of the second-order statistical moments matrix
where the multiindex has two components the first component is row and the second one column index, respectively. The projection of the centered random vector onto the K-L coordinate system uses the transformation matrix which has single rows that are eigenvectors of the matrix
Components of the transformed vector (2) are mutually uncorrelated. If we assume further on Gaussian vectors then they are also independent, i.e., and single monospectral random fields can be modelled independently. 2.2
Texture Model
We assume that single monospectral texture factors can be modelled using a Gaussian Markov random field model (GMRF). This model is obtained if the local conditional density of the MRF model is Gaussian:
where the mean value is
and
are unknown parameters.
The 2D GMRF model can be expressed as a stationary non-causal correlated noise driven 2D autoregressive process:
M. Haindl and S. Mikeš
308
where the noise is random variable with zero mean noise variables are mutually correlated
The
Correlation functions have the symmetry property hence the neighbourhood support set and its associated coefficients have to be symmetric, i.e., and The selection of an appropriate GMRF model support is important to obtain good results in modelling of a given random field. If the contextual neighbourhood is too small it can not capture all details of the random field. Inclusion of the unnecessary neighbours on the other hand add to the computational burden and can potentially degrade the performance of the model as an additional source of noise. We use hierarchical neighbourhood system e.g., the firstorder neighbourhood is etc. An optimal neighbourhood is detected using the correlation method [9] favoring neighbours locations corresponding to large correlations over those with small correlations. Parameter estimation of a MRF model is complicated by the difficulty associated with computing the normalization constant. Fortunately the GMRF model is an exception where the normalization constant is easy obtainable however either Bayesian or ML estimate requires iterative minimization of a nonlinear function. Therefore we use the pseudo-likelihood estimator which is computationally simple although not efficient. The pseudo-likelihood estimate for parameters evaluated for a sublattice and centered on the index. The pseudo-likelihood estimate for parameters has the form
where
3
Mixture Model Based Segmentation
Multi-spectral texture segmentation is done by clustering in the GMRF parameter space defined on the lattice I where
is the parameter vector (5) computed for the i-th transformed spectral band for the lattice location and is the average local spectral value. We assume that this parametric space can be represented using the Gaussian mixture model with diagonal covariance matrices. Hence the GMRF parametric space is first decorrelated using the Karhunen-Loeve transformation (analogously to (1)-(2)). The Gaussian mixture model for GMRF parametric representation is as follows:
Model-Based Texture Segmentation
309
The mixture equations (6),(7) are solved using modified EM algorithm. The algorithm is initialized using statistics estimated from the corresponding rectangular subimages obtained by regular division of the input texture mosaic. An alternative initialization can be random choice of these statistics. For each possible couple of rectangles the Kullback Leibler divergence
is evaluated and the most similar rectangles, i.e.,
are merged together in each step. This initialization results in subimages and recomputed statistics where K is the optimal number of textured segments to be found by the algorithm. After initialization two steps of the EM algorithm are repeating:
M:
The components with smaller weights than a given threshold are eliminated. For every pair of components we estimate their Kullback Leibler divergence (8). From the most similar couple, the component with the weight smaller
310
M. Haindl and S. Mikeš
than the threshold is merged to its stronger partner and all statistics are actualized using the EM algorithm. The algorithm stops when either the likelihood function has negligible increase or the maximum iteration number threshold is reached. The parametric vectors representing texture mosaic pixels are assigned to the clusters according to the highest component probabilities, i.e., is assigned to the cluster if
The area of single cluster blobs is evaluated in the post-processing thematic map filtration step. Thematic map blobs with area smaller than a given threshold are attached to its neighbour with the highest similarity value. If there is no similar neighbour the blob is eliminated. After all blobs are processed remaining blobs are expanded.
4
Experimental Results
The algorithm was tested on natural colour textures mosaics. The Fig. 1 shows four 256 × 256 experimental texture mosaics created from five natural colour textures. The last column demonstrates comparative results from the Blobworld algorithm [10]. All textures used in Fig. 1 are from the MIT Media Lab VisTex [11] collection but we obtained similar results also on our own large texture database. We have chosen natural textures rather than synthesized (for example using Markov random field models) ones because they are expected to be more difficult for the underlying segmentation model. The detected interclass borders can be checked on the Fig. 1 (third column) where they are inserted into the corresponding input mosaics. The second column demonstrates robust behaviour
Model-Based Texture Segmentation
311
Fig. 1. Selected experimental texture mosaics (A,B,F,G - downward), our segmentation results, segmentation maps inserted into original data, and Blobworld segmentation results (rightmost column), respectively.
of our algorithm while the mosaic E on Tab.1 presents the infrequent algorithm failure producing an oversegmented thematic map. Such failures can be corrected by more elaborated postprocessing step. The Blobworld algorithm [10] on these data performed steadily worse as can be seen in the last column of Fig. 1, some areas are undersegmented while other parts of the mosaics are oversegmented. Resulting segmentation results are promising however comparison with other algorithms is difficult because of lack of sound experimental evaluation results in the field of texture segmentation algorithms. The Berkeley segmentation dataset and benchmark proposed in [12] is not appropriate for texture mosaics because it is based on precise region borders localization. The comparison table Tab. 1 shows segmentation performance of the algorithm for single natural textures using the [13] performance metrics (correct > 70% GT (ground truth) region
312
M. Haindl and S. Mikeš
pixels are correctly assigned, oversegmentation > 70% GT pixels are assigned to a union of regions, undersegmentation > 70% pixels from a classified region belong to a union of GT regions). The overall probability of correct segmentation for this example is 96.5%. This result can be further improved by an appropriate postprocessing using for example the minimum area prior information.
5
Conclusions
We proposed novel efficient method for texture segmentation based on the underlying GMRF and GM texture models. Although the algorithm uses Markov random field based model it is reasonably fast because it uses efficient MPL parameter estimation of the model and therefore is much faster than the usual Markov chain Monte Carlo estimation approach. Usual handicap of segmentation methods is their lot of application dependent parameters to be experimentally estimated. Some methods need nearly a dozen adjustable parameters. Our method on the other hand requires only a contextual neighbourhood selection and two additional thresholds. The algorithm performance is demonstrated on the test natural texture mosaics and favorably compared with the alternative Blobworld algorithm, however more extensive testing is necessary. These preliminary test results of the algorithm are encouraging and we proceed with more elaborate postprocessing and some alternative texture representation models such as an alternative 3D Gaussian Markov random field model with much larger set of parameters. Acknowledgements. This research was supported by the EC projects no. IST2001-34744 RealReflect, FP6-507752 MUSCLE, and partially by the grants No. A2075302, T400750407 of the Grant Agency of the Academy of Sciences CR.
References 1. Reed, T.R., du Buf, J.M.H.: A review of recent texture segmentation and feature extraction techniques. CVGIP–Image Understanding 57 (1993) 359–372 2. Kashyap, R.: Image models. In T.Y. Young, K.F., ed.: Handbook of Pattern Recognition and Image Processing. Academic Press, New York (1986) 3. Haindl, M.: Texture synthesis. CWI Quarterly 4 (1991) 305–331 4. J. Mao, A.J.: Texture classification and segmentation using multiresolution simultaneous autoregressive models. Pattern Recognition 25 (1992) 173–188 5. D.K. Panjwani, G.H.: Markov random field models for unsupervised segmentation of textured color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (1995) 939–954 6. B.S. Munjah, R.C.: Unsupervised texture segmentation using markov random field models. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (1991) 478–482 7. Andrey, P., Tarroux, P.: Unsupervised segmentation of markov random field modeled textured images using selectionist relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 252–262
Model-Based Texture Segmentation
313
8. Haindl, M.: Texture segmentation using recursive rnarkov random field parameter estimation. In Bjarne, K., Peter, J., eds.: Proceedings of the 11th Scandinavian Conference on Image Analysis, Lyngby, Denmark, Pattern Recognition Society of Denmark (1999) 771–776 9. Haindl, M., Prototype implementation of the texture analysis objects. Technical Report 1939, Praha, Czech Republic (1997) 10. Carson, C., Thomas, M., Belongie, S., Hellerstein, J.M., Malik, J.: Blobworld: A system for region-based image indexing and retrieval. In: Third International Conference on Visual Information Systems, Springer (1999) 11. : Vision texture (vistex) database. Technical report, Vision and Modeling Group, (http://www-white.media.mit.edu/vismod/) 12. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision. Volume 2. (2001) 416–423 13. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P.J., Bunke, H., Goldgof, D.B., Bowyer, K., Eggert, D.W., Fitzgibbon, A., Fisher, R.B.: An experimental comparison of range image segmentation algorithms. IEEE Transaction on Pattern Analysis and Machine Intelligence 18 (1996) 673–689 14. Cheng, H., Jiang, X., Sun, Y., Wang, J.: Color image segmentation: advances and prospects. Pattern Recognition 34 (2001) 2259–2281 15. Fu, K., Mui, J.: A survey on image segmentation. Pattern Recognition 13 (1981) 3–16 16. Gimel’farb, G.L.: Image Textures and Gibbs Random Fields. Volume 16 of Computational Imaging and Vision. Kluwer Academic Publishers (1999) 17. Kato, Z., Pong, T.C., Qiang, S.: Multicue MRF image segmentation: Combining texture and color features. In: Proc. International Conference on Pattern Recognition, IEEE (2002) 18. Khotanzad, A., Chen, J.Y.: Unsupervised segmentation of textured images by edge detection in multidimensional features. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-11 (1989) 414–421 19. Meil, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42 (2001) 9–29 20. Pal, N.R. Pal, S.: A review on image segmentation techniques. Pattern Recognition 26 (1993) 1277–1294 21. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 888–905
A New Gabor Filter Based Kernel for Texture Classification with SVM Mahdi Sabri and Paul Fieguth Department of Systems Design Engineering, University of Waterloo Waterloo, N2L 3G1, Canada {msabri,pfieguth}@engmail.uwaterloo.ca
Abstract. The performance of Support Vector Machines (SVMs) is highly dependent on the choice of a kernel function suited to the problem at hand. In particular, the kernel implicitly performs a feature selection which is the most important stage in any texture classification algorithm. In this work a new Gabor filter based kernel for texture classification with SVMs is proposed. The proposed kernel function is based on a Gabor filter decomposition and exploiting linear predictive coding (LPC) in each subband, and exploiting a filter selection method to choose the best filters. The proposed texture classification method is evaluated using several texture samples, and compared with recently published methods. The comprehensive evaluation of the proposed method shows significant improvement in classification error rate. Keywords: Texture Classification, Support Vector Machine, Linear Predictive Coding, Gabor Filters, Segmentation.
1
Introduction
Texture analysis has been an active research field due to its key role in a wide range of applications, such as industrial object recognition [1], classification of ultrasonic liver images [2] or the detection of microcalcification in digitized mammography [3]. Texture classification algorithms generally include two crucial steps: feature extraction and classification. In the feature extraction stage, a set of features are sought that can be computed efficiently and which embody as much discriminative information as possible. The features are then used to classify the textures. A variety of classifiers have been, and we propose to use support vector machines (SVMs) which have been shown to outperform other classifiers [4]. The superiority of SVMs originates from their ability to generalize in high dimensional spaces focusing on the training examples that are most difficult to classify. SVMs can be effective in texture classification, even without using any external features [5]. In fact, in the SVM feature extraction is implicitly performed by A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 314–322, 2004. © Springer-Verlag Berlin Heidelberg 2004
A New Gabor Filter Based Kernel for Texture Classification
315
Fig. 1. A linearly nonseparable problem (Left) is converted to a linearly separable problem (Right) using a non-linear transform.
a kernel, which is defined as the dot product of two mapped patterns, and the proper selection of this kernel can significantly affect the overall performance of the algorithm. The main focus of this paper is to propose a new kernel function and to investigate the effectiveness of external features. Specifically, we propose to use linear predictive coding (LPC) [6] in subbands of Gabor filters bank to extract efficient sets of features. While most filter bank based methods require sophisticated feature selection methods to reduce the dimensionality of the feature space, our method takes advantage of high dimensionality due to the intrinsic ability of SVMs to generalize in high dimensional spaces. The rest of this paper is organized as follows. In Section 2 the SVMs are reviewed, in Section 3 the proposed kernel and filter selection algorithm are presented, Section 4 dedicated to experimental results.
2
SVM Review
The principle of SVMs is to construct a hyperplane as the decision surface in such a way that the margin of separation between training samples of different classes is maximized. Since a basic linear SVM scheme is not applicable to practical cases (which are not linearly separable), non-linear SVMs are widely used, in which a nonseparable pattern becomes separable with a high probability if projected into a nonlinear feature space of high dimensionality. Given x from the input space, let denote the non-linear features. Then, a linear decision surface in the non-linear space is:
316
M. Sabri and P. Fieguth
given the two-class training samples the weight coefficients w can be found by solving an optimization problem [7]:
where C is a regularization selected by the user, and the nonnegative variables are the Lagrange multipliers. In particular, Lagrange Multipliers are solved in a dual form:
which leads to the non-linear SVM classifier:
where the kernel function K ( . , . ) is:
as can be seen in (3) and (4), the nonlinear mapping never appears explicitly in either the dual form or in the resulting decision function. Thus it is only necessary to define K(.,.) which implicitly defines Our proposed kernel is presented in the next section.
3
LPC Kernel for Texture Classification
The performance of SVM classifier is strictly dependent on the choice of a SVM Kernel K(.,.) suited to the problem at hand. In this section we introduce a new kernel based on LPC and Gabor filters.
3.1
Linear Predictive Coding
Linear predictive coding (LPC) is a popular and effective technique in signal processing [6], which models a given signal can be approximated as p-th order autoregressive:
where is an hypothetical input term. The linear prediction model indicates that a signal can be estimated by an all pole system of order with a scaled input Proper selection of LPC
A New Gabor Filter Based Kernel for Texture Classification
317
Fig. 2. Discrimination ability of LPC for two typical textures. Dotted curves show the error margin within +/- standard deviation (std)
order leads to efficient presentation of the signal with reasonable discriminative power. Fig.2 shows discrimination ability of LPC. In this figure two typical textures A and B are considered. In Fig.2a LPC model of A is used to estimate several texture samples from A and B. In Fig.2b LPC model of texture B is used for the same experiment. Average estimation errors show strong discrimination ability for LPC model.In this paper we propose to use LPC as features for texture samples in the subbands of Gabor filters bank.
3.2
Gabor Filter
Filter banks have the ability to decompose an image into relevant texture features for the purpose of classification. Multi-channel filtering is motivated by its ability to mimic the human visual system (HVS) [8] sensitivity to orientation and spatial-frequency. This has led to a HVS model consist of independent detectors each preceded by a relatively narrow band filter tuned to a different frequency. In this way, Gabor filters are motivated to be used due to their ability to be tuned into various orientations and spatial-frequencies. In the spatial domain a Gabor function is a Guassian modulated by exponential:
In this study twenty filters are constructed using five spatial radial frequencies and four orientations as recommended in [9]. where:
318
M. Sabri and P. Fieguth
3.3
Proposed SVM Kernel
Given an L × L image window x and a bank of filters we obtain subband images LPC of are denoted as which is a vector . LPC order in (6)) was experimentally set to L. Motivated by the SVM kernel exploited in [10] for signal classification, we propose the following kernel:
which complies with the Mercer’s theorem [4]. A filter selection algorithm is used to pick best filters among K existing filters. The notation emphasize the normalization of the LPC values:
3.4
Filter Selection
In a filter bank some filters are more effective in discriminating features of a given set of textures. To address this issue, we propose a method of filter selection to optimize classifier performance. To achieve this goal we divide training samples into two disjoint subsets, training subset (T) and validation subset (V) known as cross-validation [11]. Our filter selection algorithm is as follows: Step0) and Step1) For each filter in B train classifier over T and find classifier gain over V Step2) Step3) Step4) repeat step 1 to 3 while
4
Comparison with Existing Methods
To verify the effectiveness of the proposed method(LG-SVM), experiments were performed on classification and segmentation of several test images. The test images were drawn from two different commonly used texture sources: the Brodatz album [12] and the MIT vision texture (VisTex) database [13]. All textures are gray-scale images with 256 levels. The classifiers were trained on randomly selected portions of subimages that are not included in the test images. Gray scales were linearly normalized into [–1,1] prior to training and test. The classification results are compared with original SVM [5] as well as logic operators [14], wavelet transform [15], filter banks [16], and spectral histogram [17]. The segmentation result is compared with optimal Gabor filter method [9].
A New Gabor Filter Based Kernel for Texture Classification
319
Fig. 3. Texture images used in experiments (D# is the numbering scheme in the Brodatz album) (a)D4, D84 (b)D5, D92 (c) D4, D9, D19, and D57 (d) Fabric.0007, Fabric.0009, Leaves.0003, Misc0002, and Sand.0000 (from [13] )
Images in Fig.3 are 256 × 256. Classifiers were trained by 1000 patterns from each texture. This corresponds to about 1.7 percent of the total available input patterns. The results are compared at different window sizes of 9 × 9, 13 × 13, 17 × 17,and 21 × 21. The original SVM shows the optimal classification rate at window size 17 × 17. In the proposed optimized SVM the classification error rate decreases by increasing window size. Classification error rates are presented in Table 1. The proposed method outperforms the original SVM specifically in larger window sizes. In order to establish the superiority of the LG-SVM, its performance is compared with the recently published methods. In the literature, texture classification methods are evaluated both in overlapped and non-overlapped cases. In non-overlapped case, not only there is no intersection between training and test samples but also there is no overlap between them. Our proposed method is evaluated in both cases. In Logical Operators [14] and wavelet co-occurrence features method [15] overlapped samples are used. Results are listed in Table 2.
320
M. Sabri and P. Fieguth
In spectral histogram method [17] and filters bank [16] non-overlapped samples are used (Table3). In each case parameters (e.g. sample window size, number of test and train sample) are set accordingly. The results of segmentation using proposed method are shown and compared with optimized Gabor filter method [9] in Fig.4.
5
Conclusions
This paper described an SVM classification method based on a kernel constructed on Gabor features derived by LPC. The proposed kernel creates a
A New Gabor Filter Based Kernel for Texture Classification
321
Fig. 4. Segmentation: (a)original image (b) LG-SVM (c) LG-SVM after smoothing (c) Optimized Gabor Filters [9]
feature space with more chance of separability at higher dimension. Excellent performance on different textures where achieved. It was shown that the proposed method outperforms recently published methods. In this paper 1-D LPC and all pole model were used for feature extraction. Motivated by the success of this method, using 2-D LPC and zero-pole (ARMA) model are being pursued by the authors.
References 1. J. Marti, J. Batlle, and A. Casals. Model-based objects recognition in industrial environments. In ICRA, Proc. IEEE, 1997. 2. M. H. Horng, Y. N. Sun, and X. Z. Lin. Texture feature coding for classification of liver. Computerized Medical Imaging and Graphics, 26:33–42, 2002. 3. J. Kim and H. Park. Statistical texture features for detection of microcalcifications. IEEE Transaction on Medical Imaging, 18:231–238, 1999. 4. B. Scholkopf, K. Sung, C. J.C. Burges, F. Girosi, P. Niyogi, T. Pogio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Transaction on Signal Processing, 45:2785–2765, 1997. 5. K. I. Kim, K. Jung, S. H. Park, and H. J. Kim. Support vector machine for texture classification. IEEE Transaction on PAMI, 24:1542–1550, 2002.
322
M. Sabri and P. Fieguth
6. L. Rabiner and B. H. Juang. Fundamentals of speech recognition. Printce Hall, Englewood Cliffs,NJ, 1993. 7. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, 1995. 8. D.H. Hubel and T.N. Wiesel. Receptive fields and functional architecture in two nonstriate visual areas 18 and 19 of the cat. J. Neurophysiol., 28:229–289, 1965. 9. D. A. Clausi and M. E. Jerningan. Designing gabor filters for optimal texture seprability. Pattern Recognition Letters, 33:1835–1849, 2000. 10. M. Davy and C. Doncarli. A new non-stationary test procedure for improved loud speaker fault detection. J. Audio Eng. Soc., 50:458–469, 2002. 11. R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, New York, 2000. 12. P. Brodatz. Textures Album for Artists and Designers. Newyork, 1966. 13. MIT Vision and Modeling Group. 1998. 14. V. Manian, R. Vasquez, and P. Katiyar. Texture classification using logical operators. IEEE Transaction on Image Processing, 9:1693–1703, 2000. 15. S. Arivazhagan and L. Ganesan. Texture classification using wavelet transform. Pattern Recognition Letters, 24:1513–1521, 2003. 16. T. Randen and J. H. Husoy. Filtering for texture classification: A comparative study. IEEE Trans. on Pattern Recognit. Machine Intell., 21:291–310, 1999. 17. X. Liu and D. Wang. Texture classification using spectral histogram. IEEE Trans. Image Processing, 12:661–670.
Grading Textured Surfaces with Automated Soft Clustering in a Supervised SOM J. Martín-Herrero, M. Ferreiro-Armán, and J.L. Alba-Castro Artificial Vision Team, Department of Signal Theory and Communications, University of Vigo, E36200 Vigo, Spain
[email protected], {mferreir, jalba}@gts.tsc.uvigo.es
Abstract. We present a method for automated grading of texture samples which grades the sample based on a sequential scan of overlapping blocks, whose texture is classified using a soft partitioned SOM, where the soft clusters have been automatically generated using a labelled training set. The method was devised as an alternative to manual selection of hard clusters in a SOM for machine vision inspection of tuna meat. We take advantage of the sequential scan of the sample to perform a sub-optimal search in the SOM for the classification of the blocks, which allows real time implementation.
1 Introduction Grading of surfaced textures is a common task in machine vision applications for quality inspection (QC). Samples of a textured surface are presented to the system, which has to find the degree of compliance of the surface with a standard, usually set by subjective knowledge about the expected appearance of the material in question. Therefore, texture grading generally requires the use of classifiers able to capture the knowledge about the material of the human QC operators, which usually can only be expressed through the labelling of samples or sub-samples. Industrial machine vision applications for control quality and assurance often have the additional constraint of speed: decisions about the sample being processed have to be taken in real time, usually on high speed production lines, which, on the other hand, is the major factor justifying the use of automated QC systems. Several recent works have shown the suitability of Self Organizing Feature Maps (SOM) [1] for machine vision applications [2-4]. A SOM maps a high dimensional space of feature vectors into a lower dimensional map while preserving the neighbourhood, such that feature vectors which are close together in the input space are also close to each other in the lower dimension output space, and feature vectors which are separated from each other in the input space are assigned separate nodes in the output map. Thus, a two dimensional SOM can be used to somewhat visualize the input space, allowing the QC operator to define connected neighbourhoods in the output map that correspond to arbitrarily shaped connected regions in the higher dimension input map. This is achieved by assigning representative samples to each node in the map, which is presented to the human operator as a map of textures. Provided that the underlying feature vector adequately describes the texture of the A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 323–330, 2004. © Springer-Verlag Berlin Heidelberg 2004
324
J. Martín-Herrero, M. Ferreiro-Armán, and J.L. Alba-Castro
samples for the problem at hand, the QC operator can then use his knowledge to define labelled neighbourhoods in the map that will be used for the classification of the samples in the production line (see, for instance, [4], [5]). However, this use of SOM has two drawbacks. One is speed: every feature vector has to be confronted with every node in the map to find the winner node (best matching unit, BMU), to be classified as appertaining to the labelled neighbourhood where the BMU is located. This involves the computation of a Euclidean distance or a dot product per node. Several methods have been proposed to speed up the winner search [6-8] but their suitability depends on the dimension of the feature vector, the size of the map, the required accuracy on the selection of the winner, and the shape of the surface generated by the distances from the sample feature vector to the feature vectors of the map nodes. The second drawback is due to the interaction between the QC operator and the SOM visualized as an ordered set of samples of the training set. The training sample closer to each node in the SOM is chosen as its representative to build the visual map. We have detected several sources of problems during the field test phase of a prototype using this technique: 1) Quite different samples may correspond to the same node, such that a unique sample won’t be a good representative of every sample assigned to the node; 2) Some nodes may exist that are not the BMU for any sample in the training set, such that the sample closer to them actually belongs to another node, and thus cannot be used as a representative for the former, and, therefore, they will have no representative in the map, where thus a black hole appears; 3) Even when a good map is obtained through careful design and training such that 1) and 2) are minimized or prevented, the operator may (and we have observed them to do it) pay attention to irrelevant features in the representatives which do not correspond to the actual factors affecting their classification, and thus, may define classification neighbourhoods in the map which do not agree with the features for which it was designed, and, thus, the SOM interface which was devised for a better interaction with the human operator may in fact hinder the system tuning by the operator. We have devised a method to define automatically the neighbourhoods in the SOM, as soft clusters, thus eliminating the problems due to the subjective interpretation of the visual map, while still profiting of the dimensional reduction and topological preservation inherent to the SOM to achieve better speed performance. Section 2 explains the method and section 3 describes the results obtained with the implementation of the method in a machine vision system for quality inspection of tuna meat.
2 Method The texture sample under inspection is divided into blocks with horizontal and vertical overlap, whose size is determined by the scale of the texture patterns of interest. From each block a feature vector is extracted which suitably describes the texture for the problem at hand. Feature distribution vectors (histograms) of local texture measures have shown to be suitable feature vectors for this kind of applications [9].
Grading Textured Surfaces with Automated Soft Clustering
325
The feature vectors of the block define the input or feature space, which will be mapped into a subspace, by a SOM, which, if it is correctly trained, will preserve the intrinsic topology of F. For that purpose, first a sufficient number of representative blocks has to be extracted from samples to be labelled by trained operators and make up the training set. Each block is labelled as appertaining to one of a number of classes, C, ranging from “bad quality” to “good quality”. The size of the training set has to be adequate to the size of the SOM, and the size of the SOM depends on the shape of the feature space within and on the number of classes, such that enough separability between classes may be warranted in the resulting map. This is usually tuned via repeated trials for different configurations. After training, the SOM is constituted by an ordered two dimensional set of nodes, each one characterized by its feature vector, thus mapping F into Note that the labels of the training set do not play any role during the training of the SOM. In the classifying phase, to classify a block, we have to extract its feature vector, v, and locate the BMU, the minimizing where d denotes the Euclidean distance. Then the block is said to correspond to node ij, and, if labelled neighbourhoods have been defined in the map, it is assigned the same label as the neighbourhood that the node ij belongs to.
2.1 Automated Clustering of the SOM To automatically define labelled neighbourhoods in the SOM, we classify every block in the training set and record, for every node and each class, the number of times that it has been the BMU for a block of the class, It is not a rare event that blocks of different classes are assigned the same node in the map, and, thus, that may be greater than zero for several classes, c, and the same node, ij. This is what drove us to use soft clusters to label the neighbourhoods in O. For every node ij in the map a sensitivity function is defined for every class, c, for which as:
Then every node is assigned a vector membership of the node to class c, obtained from:
whose
component is the
This allows soft classification of blocks, such that once the BMU for a block has been found, the block will add up to the sample final grading result its corresponding memberships to each quality degree or class. The final grading for the sample is the result of adding up the partial classifications of the blocks extracted from the sample.
326
J. Martín-Herrero, M. Ferreiro-Armán, and J.L. Alba-Castro
2.2 Fast Sub-optimal Winner Search The most common way of finding the BMU in the map is by using exhaustive winner search. However exhaustive winner search may take a long time for real time applications. To reduce BMU search time, we propose a sub-optimal BMU search based on the topology preserving feature of the SOM: if we classify two adjacent overlapping image blocks, there is a high probability that their feature vectors are close to each other. Provided that the SOM preserves the topology of the feature space, F, they should also be close to each other in O. Overlap and spatial correlation of the image causes that the feature vectors from adjacent image blocks are close in the feature space. SOM training is designed to obtain a topology preserving map between and which means that adjacent reference vectors, in F, belong to adjacent nodes ij in O. When this condition holds we can perform a BMU search that takes advantage of the image scanning to search only in a neighbourhood of the last winning node. To ensure good results, it is necessary to quantify the topology preservation of the SOM. Probably the most referenced method for performing this measurement is the topographic product, P, [10], such that P < 0 if dim(O) < dim(F); if dim(O) = dim(F); and P > 0 if dim(O) > dim(F). Therefore, a P value close to 0 will indicate that is able to map neighbour feature vectors to neighbour output nodes. Therefore, we take advantage of the sequential scanning of the blocks in the image, to restrict the BMU search of each block to the neighbourhood of the BMU of the previous block. We perform the neighbourhood search describing a spiral around the previous BMU. The BMU for the current block is the nearest node to the block’s feature vector within the given neighbourhood. The size of the neighbourhood, and thus the performance of the search, is given by the maximum search distance. Error curves graphing the distance between the real BMU (obtained through exhaustive search) and the approximate BMU for a set of blocks will provide the grounds for choosing the maximum search distance. Care has to be taken to reset the search from time to time in order to not to accumulate excessive error due to the use of approximate BMUs as initial points for the search. The optimal moment to reset the search is at the beginning of each scan line of blocks in the image. Consequently, the search for the first block in each scan line is exhaustive, thus warranting that the accumulated error is eliminated. Anyway, the first block in a scan line and the previous block (the last block in the previous line) are not adjacent in the image, and therefore, the assumption that the BMU of the former is in the neighbourhood of the BMU of the latter does not hold, and thus the exhaustive search is compulsory. The relative extent of the neighbourhood search area with respect to O will give the gain in time obtained with the neighbourhood search with respect to the exhaustive search. There is no overhead due to the spiral search because this can be performed with the same computational cost as the usual raster scan. This is achieved by using a displacement vector which is rotated 90° at step where n is the step number and ([·] denotes integer part). Rotating is achieved through Thus, a single loop is enough to perform the entire search.
Grading Textured Surfaces with Automated Soft Clustering
327
3 Experimental Results We implemented the method on a industrial machine vision system for quality inspection of tuna meat that used a SOM for interfacing with the QC operator [5], [11]. The samples for inspection are the contents of tuna cans. The cans are round, and the images are acquired such that the cans have a diameter of 512 pixels. The blocks fed to the SOM are 50x50, with a 60% horizontal and vertical overlap, totalling an average 300 blocks per can. The feature vectors consist of a mixture of local binary patterns (LBP) [12], entropy of the co-occurrence matrix [13] and a measure of correlation between different affine transformations of the block, such that the dimension of the input space is d = 7. We used a 12×12 SOM, trained with 1000 training blocks belonging to three different classes, C = 3, corresponding to “Good quality”, “Medium quality”, and “Bad quality”. We used the usual training method [1]. The typical topographic product obtained was This means that the map achieved satisfactory neighbourhood preservation. This allowed fast BMU search as described in 2.2.
Fig. 1. A SOM of tuna meat. Note the black nodes which did not win for any block in the training set.
328
J. Martín-Herrero, M. Ferreiro-Armán, and J.L. Alba-Castro
Figure 1 shows a typical map where each node is represented by the closest block in the training set. Figure 2 shows the membership to each of the three classes (Figures 2(a), 2(b) and 2(c)) for every node in O. The figures show that the darkest areas in each one occupy a different region of the map, and a certain degree of overlap exists between the different classes, which is accounted for via the soft classification scheme allowed by the use of soft membership.
Fig. 2. Membership of every node in the map to each of the three classes: a) good quality tuna, b) medium quality tuna, c) bad quality tuna. Darker areas indicate a higher membership.
Next, we studied the spatial correlation between blocks in the image and the corresponding BMU in O. Figure 3 shows three curves for three kinds of samples (tuna cans): generally good looking, generally medium looking, and generally bad looking. As expected, bad looking tuna has a higher disorder, and, thus, neighbouring blocks are less similar than in good looking cans. Anyway, the shape of the three curves supports our statement that neighbouring blocks in the image have their corresponding BMU in near locations of the SOM, and, thus, the fast search method can be applied with low cost. To evaluate this cost, we produced Figure 4, were we can see error rates for the different classes and average error distance in the map. The bars show, for each class, the percentage of blocks (N = 6000 blocks) which were assigned the wrong BMU due to the sub-optimal search, i.e. a different BMU than that found by exhaustive search. The line (secondary axis) shows the average distance between the approximate BMU and the real BMU in O. If we take a search neighbourhood around the previous BMU of 49 nodes, the sub-optimal search requires just 34% of the time needed by an exhaustive search, and we get an average error rate of about 7%. We can reduce this average error rate to 3% if we increase the search neighbourhood to 81 nodes. However, this would give us 56% of the time required by an exhaustive search to perform the sub-optimal search. Field tests where the cans thus graded were compared to the grading provided by QC operators showed that the level of performance of the system had been maintained in spite of the time gain achieved and the automated generation of the map of classes. The cost of the improvement devolve upon the training phase, which now requires a labelled training set that has to be generated in collaboration with the QC operators.
Grading Textured Surfaces with Automated Soft Clustering
329
Fig. 3. Correlation between distances between blocks in the image sample and distances between the corresponding BMU in the map for different types of cans. 90% percentiles are also shown.
Fig. 4. Error rates for each of the three classes due to the sub-optimal search. The bars show the percentage of blocks for which the approximate BMU and the real BMU (exhaustive search) differed. The line shows the average distance in the map between the approximate BMU and the real BMU (N = 6000 blocks).
330
J. Martín-Herrero, M. Ferreiro-Armán, and J.L. Alba-Castro
References 1. 2. 3.
4.
5.
6.
7. 8.
9.
10. 11.
12.
13.
Kohonen, T.: Self-organizing Maps. Springer-Verlag, Berlin (1997) Niskanen, M., Kauppinen, H., Silven, O.: Real-time Aspects of SOM-based Visual Surface Inspection, Proceedings of SPIE. Vol. 4664, (2002) 123-134 Niskanen, M., Silvén, O., Kauppinen, H.: Experiments with SOM Based Inspection of Wood, International Conference on Quality Control by Artificial Vision (QCAV2001). 2 (2001) 311-316 Kauppinen, H., Silvén, O., Piirainen, T.: Self-organizing map based user interface for visual surface inspection, Scandinavian Conference on Image Analysis (SCIA99). (1999) 801-808 Martín-Herrero J., Ferreiro-Armán, M., Alba-Castro, J. L.: A SOFM Improves a Real Time Quality Assurance Machine Vision System, Accepted for International Conference on Pattern Recognition (ICPR04). (2004) Cheung, E.S.H., Constantinides, A.G.: Fast Nearest Neighbour Algorithms for selfOrganising Map and Vector Quantisation, Asilomar Conference on Signals, Systems and Computers. 2 (1993) 946-950 Kaski, S.: Fast Winner Search for SOM-Based Monitoring and Retrieval of HighDimensional Data, Conference on Artificial Neural Networks. 2 (1999) 940-945 Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces, ACM Symposium on Theory of Computing. (1998) 614-623 Ojala, T., Pietikäinen, M., Harwood, D.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions, Proc. 12th International Conference on Pattern Recognition. Vol. I (1994) 582-585. Bauer, H.-U., Herrmann, M., Villmann, T.: Neural maps and topographic vector quantization, Neural Networks, Vol. 12(4-5) (1999) 659-676 Martín-Herrero J., Alba-Castro J.L.: High speed machine vision: The canned tuna case, in J. Billingsley (ed.) Mechatronics and Machine Vision in Practice: Future Trends. Research Studies Press, London, (2003) Mäenpää, T., Ojala, T., Pietikäinen, M., Soriano, M.: Robust texture classification by subsets of Local Binary Patterns, Proceedings of the International Conference on Pattern Recognition. (2000) Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley. New York. (1992)
Textures and Wavelet-Domain Joint Statistics Zohreh Azimifar, Paul Fieguth, and Ed Jernigan Systems Design Engineering, University of Waterloo Waterloo, Ontario, Canada, N2L 3G1 {szazimif,pfieguth,jernigan}@uwaterloo.ca
Abstract. This paper presents an empirical study of the joint wavelet statistics for textures and other random imagery. There is a growing realization that modeling wavelet coefficients as independent, or at best correlated only across scales, assuming independence within a scale, may be a poor assumption. While recent developments in wavelet-domain Hidden Markov Models (notably HMT-3S) account for within-scale dependencies, we find empirically that wavelet coefficients exhibit within- and across-subband neighborhood activities which are orientation dependent. Surprisingly these structures are not considered by the state-of-the-art wavelet modeling techniques. In this paper we describe possible choices of the wavelet statistical interactions by examining the joint-histograms, correlation coefficients, and the significance of coefficient relationships.
1
Introduction
Statistical models, in particular prior probability models, for underlying textures are of central importance in many image processing applications. However because of the high dimensionality (long-range) of spatial interactions, modeling the statistics of textures is a challenging task. Statistical image modeling can be significantly improved by decomposing the spatial domain pixels into a different basis, most commonly a set of multiscale-multichannel frequency subbands, referred to as the wavelet domain [1]. Indeed, the wavelet transform (WT) has widely been used as an approximate whitener of statistical time series. It has, however, long been recognized [2] that the wavelet coefficients are neither Gaussian, in terms of the marginal statistics, nor white, in terms of the joint statistics. The wavelet parsimony representation observes that the majority of the coefficients happen to be small, and only a few of the coefficients are large in magnitude, implying that the marginal distributions of the high frequency wavelet subbands are more heavily tailed than a Gaussian, with a large peak at zero. Existing works assume a generalized Gaussian model, some sort of mixture, for the marginal distribution [1]. Chipman et al. [1] and Crouse et al. [2] showed that this heavy-tailed non-Gaussian marginal can be well approximated by a Gaussian Mixture Model (GMM). Accordingly, wavelet non-linear shrinkage, such as Bayesian estimation has been achieved with these non-Gaussian priors, which consider this kurtosis behavior of the wavelet coefficients. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 331–338, 2004. © Springer-Verlag Berlin Heidelberg 2004
332
Z. Azimifar, P. Fieguth, and E. Jernigan
Fig. 1. Three hidden Markov models. The empty circles show the hidden states and black nodes are the coefficient values, (a) Hidden states are independent, (b) Interscale dependencies are modeled, (c) The three subbands are integrated into one hybrid HMT.
A opposed to the marginal models, the question of joint models is much more complicated and admits for more possibilities, with structures possible across subbands, orientations, and scales. Since the development of zerotree coding for image compression there have been many efforts to model these structures, including Markov random fields (MRFs) [3], [4], Besov spaces [5], and the wavelet hidden Markov models (HMMs) [2], [6], [7] and Gaussian scale mixture(GSM) [8]. The wavelet-based HMMs, in particular, have been thoroughly studied and successfully outperform many wavelet-based techniques in Bayesian denoising, estimation, texture analysis, synthesis and segmentation. HMMs are indeed intended to characterize the wavelet joint statistics. As visualized by Fig. 1, the class of the HMMs mainly includes Independent Mixture Model (IMM) [2], Hidden Markov Tree (HMT) [2], and HMM-3S [7]. In general, they adopt a probabilistic graph, in which every wavelet coefficient (node) is associated with a set of discrete hidden states S = 0,1,..., M – 1 (in particular M = 2) displayed as empty circles in Fig. 1. To model the connectivity of those states, HMMs first define some hypothesis based on the wavelet coefficients properties, then parameterize models that fit into those assumptions and can be solved by existing algorithms. In the two-state IMM, the simplest case of HMMs, hidden states are assumed to be independent and every wavelet coefficient is modeled as Gaussian, given its hidden state value (the variance). More sophisticated approaches sought to model the local wavelet statistics by introducing Markovian dependencies between the hidden state variables across scales and orientations. Crouse et al. [2] introduced the HMT, which captures wavelet interscale dependencies by considering Markov chains across scales, while assuming independence within and across the three high frequency channels. Fan and Xia [7] proposed HMT-3S in which, in addition to the joint interscale statistics captured by HMT, the dependencies across subbands are exploited by integrating three corresponding coefficients across three orientations. Goal of this paper: Motivated by these inter-coefficient probabilistic studies, the primary goal of this paper is to study the wavelet joint statistics by empirically investigating local random field neighborhoods representing statistics of within- and across-scale coefficients. Although the previous observations
Textures and Wavelet-Domain Joint Statistics
333
Fig. 2. Focus of this work: Development of joint Gaussian models of wavelet statistics.
highlighted some main wavelet coefficient correlation, there is still uncertainty in wavelet statistics as to whether these approaches offer reasonable choices of correlations? How should one examine sufficiency of wavelet models? Would these models be justified by empirical statistics? This paper is meant to discuss these issues and demonstrate the structure of coefficient correlation that was not captured by HMMs due to their primary assumptions. Development of wavelet Gaussian random field models for statistical textures forms the focus of our work (shown in Fig. 2). The goal, of course, is the development of non-Gaussian joint models with non-trivial neighborhood. However for the purpose of this paper, we are willing to limit ourselves to simplifying marginal assumptions (Gaussianity) which we know to be incorrect, but which allow us to undertake a correspondingly more sophisticated study of joint models. Example joint histograms as representatives of the underlying coefficients densities are visualized. We display the hierarchy of wavelet covariance structure and define statistical neighborhoods for the coefficients. The main novelty is the systematic approach we have taken to study the wavelet neighborhood system including 1) inter-scale dependency, 2) within-scale clustering, and 3) across-orientation (geometrical constraints) activities. This probabilistic modeling is directly applied to the wavelet coefficient values, but to some extent their significance is also considered. Surprisingly our empirical observation indicates that the wavelet correlation structure for different textures does not always match with those offered by the HMMs. We will discuss this in later sections.
2
Wavelet Neighborhood Modeling
In order to study exact correlations between the wavelet coefficients we considered a class of statistical textures based on Gaussian Markov random field (GMRF) covariance structures, as shown in Fig. 3. They are spatially stationary, an assumption for convenience only and is not fundamental to our analysis. The chosen spatial domain covariance structure is projected into the wavelet domain by computing the 2-D WT W, containing all translated and dilated versions of the selected wavelet basis functions:
where we have restricted our attention to the set of Daubechies basis functions.
334
Z. Azimifar, P. Fieguth, and E. Jernigan
Fig. 3. (a-e) Five GMRF textures used to visualized wavelet correlation structure, (f) Correlation coefficients of a spatial thin-plate model in the wavelet domain, in which the main diagonal blocks correspond to the same scale and orientation, whereas off-diagonal blocks illustrate cross-correlations across orientations or across scales.
The wavelet covariance, (Fig. 3(f)), is not a diagonal matrix, indicating that the wavelet coefficients are not independent. Intuitively, localized image structures such as edges tend to have substantial power across many scales. More interestingly, is block-structured, and it is evident that the coefficients interactions align with direction of their subband. We have observed [9] that, although the majority of correlations are very close to zero (i.e., decorrelated), a relatively significant percentage (10%) of the coefficients are strongly correlated across several scales or within a particular scale but across three orientation subbands. Clearly a random field model for wavelet coefficients will need to be explicitly hierarchical. One approach to statistically model these relationships was to implement a multiscale model [9]. Although the multiscale model captured the existing strong parent-child correlation, spatial and inter-orientation interactions are not explicitly taken into consideration. Our most recent work [10] investigated two techniques to approximate non-Markov structure of into a Markovian neighborhood which contains the significance of inter-orientation and spatial relationships, which we seek to visualize more formally and compare with other methods in this paper.
Textures and Wavelet-Domain Joint Statistics
335
Fig. 4. Empirical joint histograms of a coefficient at position (x,y) in a horizontal subband associated with different pairs of coefficients at the same scale and orientation (a,b), at the same orientation but adjacent scales (c,d), at the same scale but across orientations (e-h). The skewness in the ellipsoid indicates correlation.
2.1
Wavelet Domain Joint Histograms
In order to characterize the wavelet neighborhood explicitly, we first utilize joint histogram plots. This intermediate step helps to identify two coefficients’ dependency even if they show as decorrelated on their correlation map (i.e. decorrelation does not always mean independence!). For a typical texture, joint histograms of a horizontally aligned coefficient at position (x,y) associated with different pairs of coefficients are illustrated in Fig. 4. These plots highlight the following important aspects of the coefficients connectivity: Remark 1: In the top row, the first two plots show extended contours indicating that two spatially adjacent horizontal coefficients not only are dependent but also the direction of their correlation matches with that of their subband. For instance, within its subband, a horizontal coefficient is more correlated with its adjacent left and right neighbors than up and down neighbors. Remark 2: The top row’s last two plots are joint histograms of parent-child horizontal coefficients. It is a quite evident that a child strongly depends not only on its parent (a fact observed by many other researchers) but also on its parent’s adjacent neighbor (left or right). We also observed that, by symmetric, a vertical coefficient statistically depends on its parent and parent’s upper or lower neighbor. Remark 3: The bottom row plots display joint histograms of a horizontal coefficient with its corresponding neighbors within the same scale but across other two orientations. Firstly, the nearly circular contours indicate that coefficients at the same location but from different orientations are almost independent! Sec-
336
Z. Azimifar, P. Fieguth, and E. Jernigan
ondly, there is still some inter-orientation correlation which aligns with direction of the centered coefficient (i.e. correlation structure is subband dependent). In summary, we emphasize that this paper is not to report the striking wavelet correlations exhibited in these empirical observations. Rather, it is observed that, surprisingly, the existing wavelet joint models not only consider a subset of these inter-relationships but also fail in connecting some coefficients which are indeed independent, e.g. in HMT-3S three coefficients at the same location from three subbands are grouped into one node (assumed to be correlated), an assumption that is rejected by these histogram plots.
2.2
Wavelet Domain Correlation Structure
Being motivated by the histogram plots, we have chosen to study the problem visually, and without any particular assumption regarding the coefficient position on the wavelet tree. First, correlation coefficients are calculated from the wavelet prior for three fine scale textures displayed in Fig. 3. As shown in Fig. 5, we use the traditional 2-D wavelet plot to display the correlation of a coefficient coupled with any other coefficient on the entire wavelet tree. Each panel includes three plots illustrating local neighborhood for a centered coefficient (marked by chosen from horizontal, vertical, and diagonal subbands. The left column panels in Fig. 5(a-c), show correlation coefficients for a coefficient paired with all other nodes on the wavelet tree. There is a clear consistency between the joint histograms and these correlation maps which shows 1) The concentration of the wavelet correlations in a locality. 2) This locality increases toward finer scales, which supports the persistency property of wavelet coefficients [6]. 3) The local neighborhood definition for any given pixel is not limited to the pixel’s subband: it extends to dependencies across directions and resolutions. Besides the long range across scale correlations, every typical coefficient exhibits strong correlation with its spatially near neighbors both within subband and across orientations. 4) The correlation structure for horizontally and vertically aligned coefficients are almost symmetrically identical. For textures whose edges extend more or less toward one direction (such as tree-bark), this similarity does not hold. To consider the sparse representation property of the WT, these empirical evaluations have been extended to dependency structure of those significant coefficients. In [9], we defined the significance map as a tool to identify those correlations corresponding to the significant coefficients. Fig. 5(d-f) show the significance of correlations for the corresponding panels displayed in Fig. 5(a-c). It is evident from these diagrams that within scale dependency range reduces to shorter locality (yet orientation dependent), but across scale activities still present up to several scales. Interestingly, the wavelet correlation plots in Fig. 5 show a clear consistency in structure for many textures. They confirm that 1) The well-structured coefficients dependencies are hierarchical and orientation dependent. 2) Coefficients across three orientations and at the same spatial position are decorrelated, however, there is a clear dependency between coefficients across orientations and
Textures and Wavelet-Domain Joint Statistics
337
Fig. 5. Wavelet correlation structure for three fine scale textures displayed in Fig. 3. Each panel contains three plots illustrating local neighborhood for a centered coefficient (marked by from horizontal, vertical, and diagonal subbands. The left column panels (a-c) show correlation coefficients for a coefficient paired with all other nodes on the wavelet tree. The right column (d-f) are plots of significance of above interrelationships.
338
Z. Azimifar, P. Fieguth, and E. Jernigan
at nearby neighborhood. 3) Coefficients are correlated with their parent and neighbors of it, in addition to parents across other two coarser subbands.
3
Conclusions
A thorough study of the 2-D wavelet statistics has been presented in this paper. Empirical examination of the coefficient correlations, within or across scales, revealed the fact the there exist local and sparse random field models governing these local dependencies. A superset including all statistically local neighbors for a wavelet coefficient was demonstrated. We compared our modeling observations with the advanced wavelet joint models. This study showed that the correlation structures presumed and proposed by those approaches (such as HMT-3S) does not always accurately integrate the correlated coefficients. We also discussed examples of interscale and intra-scale dependencies that are missing in the existing models. We are expanding this ongoing research to the statistics of real world images. The early empirical examinations show consistency with the correlation structures studied in this article.
References 1. H. Chipman, E. Kolaczyk and R. McCulloch, “Adaptive Bayesian wavelet shrinkage”, J. Amer. Statis. Assoc., pp. 92-99, 1997 2. M. S. Crouse, R. D. Nowak, R. G. Baraniuk, “Wavelet-based statistical signal processing using hidden Markov models”, IEEE trans. on SP, vol. 46, pp. 886-902, 1998 3. M. Malfait and D. Roose, “Wavelet-based image denoising using a Markov random field a priori model”, IEEE Trans. on IP, vol. 6, pp. 549-565, 1997 4. A. Pizurica, W. Philips, I. Lemahieu, and M. Acheroy, “A joint inter- and intrascale statistical model for Bayesian wavelet based image denoising”, IEEE Trans. on IP, vol. 11, pp. 545-557, 2002 5. A. Srivastava, “Stochastic models for capturing image variability”, IEEE Signal Processing Magazine, vol. 19, pp. 63-76, 2002 6. J. Romberg, H. Choi, and R. Baraniuk, “Bayesian tree-structured image modeling using wavelet-domain hidden Markov models”, IEEE trans. on IP, vol. 10, pp. 10561068, 2001 7. G. Fan and X. Xia, “Wavelet-based texture analysis and synthesis using hidden Markov models”, IEEE Trans. on Cir. and Sys., vol. 50, pp. 106-120, 2003 8. J. Portilla, V. Strela, M. J. Wainwright and E.P. Simoncelli, “Image denoising using Gaussian scale mixtures in the wavelet domain”, IEEE Trans. on IP, vol. 12, pp. 1338-1351, 2003 9. Z. Azimifar, P. Fieguth, and E. Jernigan. “Towards random field modeling of wavelet statistics.” Proceedings of the 9th ICIP, 2002. 10. Z. Azimifar, P. Fieguth, and E. Jernigan. “Hierarchical Markov models for waveletdomain statistics.” Proceedings of the 12th IEEE SSP, 2003.
Video Segmentation Through Multiscale Texture Analysis Miguel Alemán-Flores and Luis Álvarez-León Departamento de Informática y Sistemas Universidad de Las Palmas de Gran Canaria Campus de Tafira, 35017, Spain {maleman,lalvarez}@dis.ulpgc.es
Abstract. Segmenting a video sequence into different coherent scenes requires analyzing those aspects which allow finding the changes where a transition is to be found. Textures are an important feature when we try to identify or classify elements in a scene and, therefore, can be very helpful to find those frames where there is a transition. Furthermore, analyzing the textures in a given environment at different scales provides more information than considering the features which can be extracted from a single one. A standard multiscale texture analysis would require an adjustment of the scales in the comparison of the textures. However, when analyzing video sequences, this process can be simplified by assuming that the frames have been acquired at the same resolution. In this paper, we present a multiscale approach for segmenting video scenes by comparing the textures which are present in their frames.
1
Introduction
In this paper, we present a method for video segmentation based on the distribution of the orientation of the edges. We use the results of the multiscale texture analysis described in [1] and study the behavior of natural textures in order to find the transitions between the different video scenes. To this end, we estimate the gradient in every point of the region and build an orientation histogram to describe it. This allows performing satisfactory classifications in most cases, but some of them are not properly classified. A multiscale analysis of the textures improves the results, considering the evolution of the textures along the scale. In natural textures, the changes produced when a certain scene is observed at different distances introduce new elements which must be taken into account when comparing the views. This texture comparison technique is applied to video segmentation by considering those intervals within which the energy is low enough to be considered as normally evolved video sequences. The paper is structured as follows: Section 2 shows how textures can be described and classified through their orientation histograms. In section 3, multiscale analysis is introduced to improve the classification method and some considerations are analyzed in natural textures. Section 4 describes the application of multiscale texture comparison to video segmentation. Finally, in section 5, we give an account of our main conclusions. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 339–346, 2004. © Springer-Verlag Berlin Heidelberg 2004
340
2
M. Alemán-Flores and L. Álvarez-León
Texture Description and Classification
In order to describe a texture in terms of the edges which are present in it, we must estimate the magnitude and the orientation of the gradient in every point of the region. With these values, we can build an orientation histogram which reflects what the relative importance of every orientation is. We first calculate an initial estimation for every point using the following mask for the horizontal component and its transpose for the vertical component
Using the structure tensor method, the orientation of the gradient at a certain point can be estimated by means of the eigenvector associated to the lowest eigenvalue of the matrix in (2), whereas the magnitude can be approximated by the square root of its highest eigenvalue. We first convolve the image with a Gaussian to increase the robustness of the approximations. By adding the magnitude in the points with the same orientation, we can build an orientation histogram for each texture. These histograms are normalized, so that the global weight is the same for all of them.
In order to compare two textures, an energy function is built, in which the Fourier coefficients of both histograms are analyzed. A change in the orientation of a texture will only cause a cyclical shift in the histogram. For this reason, the Fourier coefficients are modified as follows: let and be the orientation histograms of length L corresponding to the same texture but shifted a positions, i.e. the texture has been rotated an angle and let and be the Fourier coefficients of these histograms, then In addition, the fact that the number of discrete orientations used for the histograms is constant as well as the normalization of the weights make the lengths of the signals and the total weight equal in both textures. Due to the fact that the higher frequencies are more sensitive to noise than the lower ones, a monotonic decreasing weighting function can be used to emphasize the discrimination, thus obtaining the following expression, in which the first terms have a more important contribution than the last ones:
Video Segmentation Through Multiscale Texture Analysis
341
Fig. 1. Results of searching for similar textures for a texture in database 1 and a texture in database 2
To test this technique, we have used two sets of textures contained in two databases. The first database has been made publicly available for research purposes by Columbia and Utrecht Universities [2] and consists of different materials. The second one corresponds to different natural scenes acquired at several distances. In Fig. 1, we show some results of the application of the technique explained above. From the image databases, one is selected and the five images which produce the lowest energies are shown. The orientation histograms extracted from the textures describe how the different orientations are quantitatively distributed across the region which is studied, but they provide no information about the spatial neighborhood of the pixels with a certain orientation. Thus, a completely noisy image, in which all orientations are found in approximately the same proportion, but in a disordered way, would generate a similar histogram as a circle, where the orientation increases gradually along its outline. This forces us to search for a certain technique which complements the information provided by this kind of histograms in order to enhance their recognition capability.
3
Multiscale Texture Analysis
The interpretation of the information we perceive from the environment depends on the scale we use to process it. The multiscale analysis approach has been successfully used in the literature for texture enhancement and segmentation (see [3] and [4] for more details). A multiscale analysis can be determined by a set of transformations where represents the scale. Let I be an image, i.e. where is the domain where the image is defined. We will consider that (I and have finite norm) and is a new image which corresponds to 7 at a scale For a given image I, to which the multiscale analysis is applied, we can extract a histogram which determines the distribution of the orientations of I at scale In this case, the normalization of the values within a
342
M. Alemán-Flores and L. Álvarez-León
histogram is performed with respect to the initial addition. In order to compare the histograms of two images, the scale must be first adjusted. 3.1
Gaussian Multiscale Analysis
We will use a Gaussian filter, whose properties are described in [5] and [6]. In one dimension, this process can be quantized as follows, where the scale is related to the standard deviation according to the expression
Given a signal the result of convolving with the Gaussian filter is equivalent to the solution of the heat equation, given by where is the solution of the equation, using as the initial data Considering this relationship, a discrete version of the heat equation can be used to accelerate the approximation of the Gaussian filtering (see [7] for more details), which results in a recursive scheme in three steps for each direction. This process will be performed by rows and by columns in order to obtain a discrete expression for a two-dimensional Gaussian filtering. Making use of the features of the Gaussian kernels, the result of applying a Gaussian filter with an initial scale can be used to obtain a Gaussian filtering of the initial image for a different scale with no need to start again from the input. 3.2
Multiscale Orientation Histogram Comparison
We must take into account that, for a certain texture, the use of different resolutions forces us to apply Gaussian functions with different standard deviations, thus requiring an adaptation stage. To do that, we extract the evolution of the magnitude of the gradients at different scales and we use them to compare the textures. Even if the quantitative distribution of the orientations may be alike for different textures, the spatial distribution will cause a divergence in the evolution, so that the factors will differ. One of the properties of the Gaussian filtering is the relationship between the resolution of two images and the effects of this kind of filters. In fact, the result of applying a Gaussian filter with standard deviation to an image with resolution factor is equivalent to applying a Gaussian filter with standard deviation to the same image acquired with a resolution factor kx. Given two textures, and we will estimate the scale factor using the normalized evolution of the addition of the norm of the gradient, that is, we will use:
Video Segmentation Through Multiscale Texture Analysis
343
Fig. 2. Comparison of two similar textures at different scales
It is well known (see for instance [5]) that is a decreasing function with respect to and On the other hand, if then considering that the texture is periodically repeated. Consequently, in order to estimate a scale factor between two textures and we will compare the functions and Let and be the ratios obtained for two textures at scale the best adjusting coefficient to fit the series of to that of both consisting of N terms, can be obtained as follows: We first fit a value and we interpolate the values in the series and to obtain two new series and which estimate the scales for which the ratios are obtained. In other words, we estimate the scale where We must point out that, if nh < 1, then and are well-defined, because is a decreasing function with respect to and With these values, we minimize the following error to obtain the scale factor
We can study how the energy obtained when comparing the orientation histograms evolves as we apply a Gaussian filtering to the textures. We use the adjusting factor to relate the scales to be compared and we obtain the energies for the comparison of the histograms at N different scales. Figure 2 shows the results of comparing two images corresponding to similar textures, acquired at different distances. As observed, not only the initial energy is low, but also the subsequent energies, obtained when comparing the images at the corresponding scales, decrease when we increase the scale. On the other hand, Fig. 3 shows the comparison of two images of different textures. The energies, far from decreasing, increase from the initial value.
3.3
Resolution Adjustment in Natural Scenes
We have extracted the evolution of the square of the gradient across the image for all the textures in the second database, in which different natural scenes
344
M. Alemán-Flores and L. Álvarez-León
Fig. 3. Comparison of two different textures
have been acquired at different distances. With these values, we have calculated a ratio for every couple of pictures in the database. Instead of observing a great variability in the ratios according to the different natures and distances, they are very close to 1 in most cases. The fact that certain particular elements appear when we approach them, while other global elements disappear, thus generating new gradients while other ones are eliminated, makes the total addition similar, and the information, in terms of changes existing in the image, is approximately constant. In fact, the mean ratio for the comparison of two textures, considering in each case the ratio which is lower than 1, is 0.91975, with standard deviation 0.06190. In artificial textures, a change in the resolution produces a change in the evolution of the addition of the squares of the gradients and no additional information is added, thus generating more variable ratios.
4
Video Segmentation
The multiscale comparison of natural textures described above has been used to segment video sequences by finding the transitions in which the texture histogram undergoes a great change. On the assumption that, when a scene finishes and a new one starts, the textures in the frames are quite different, the energies obtained when comparing them will be significant and the transition can be located. If we force the system to be sensitive enough to avoid overlooking any scene transition, the threshold which determines from which value a change is considered as significant may be too low to avoid including some intra-scene changes as transitions, thus reducing the specificity. At the same time, the transitions can be either abrupt, i.e. a scene finishes in frame and the new scene starts in frame or soft, i.e. there is a diffusion, shift, or any other effect to go from a scene to the following. The latter type forces us to compare frames which are not consecutive in order to detect the change. But this might include more intra-scene changes as transitions. Thus, a multiple temporal interval is needed. We have used a set of videos and reports provided by researchers from the Universidad Autónoma de Madrid [8]. Human observers have signaled the frames where a transition is found, and we have compared these values with the frames where the energy is higher that a certain threshold. We have used four versions of every frame: the original image and the image after the application of a Gaussian filter with and 10. The best results have been obtained using the mean of the two intermediate values for and 10.
Video Segmentation Through Multiscale Texture Analysis
345
Fig. 4. Example of scenes and transitions detected in a video sequence. Every couple of images corresponds to the initial and final frames of a scene
If we use an interval of 10 frames in texture comparison in order to determine where a transition occurs, we are able to detect all actual transitions in the sequence of video frames. However, 18% of normal changes, i.e. those which occur between frames of the same scene, are labelled as transitions, since there is a considerable evolution of the elements in them. If we consider a combination of the energies for and 10, these false transitions are reduced to 12%. Furthermore, if we select the candidates to be transitions for a temporal interval of 10 frames and we analyze them with a temporal interval of 5 frames, we can refuse some of them considering the changes as normal intra-scene evolutions and the false transitions are reduced to 10%. Table 1 shows a comparison of the results using these methods. Figure 4 shows the initial and final frames of different scenes extracted for a video sequence.
346
5
M. Alemán-Flores and L. Álvarez-León
Conclusion
In this paper, we have presented a new approach to video sequence segmentation based on a multiscale classification of natural textures. By using the structure tensor, we have obtained an estimation of the gradient in every point of the textures. The extraction of orientation histograms to describe the distribution of the orientations across a textured region and the multiscale analysis of the histograms have produced quite satisfactory results, since the visual similarity or difference between two textures is much more reliably detected by the evolution of the energies resulting when comparing the histograms at different scales. We have observed how the ratio for the adjustment of the scales is not far from 1 when natural images are considered, since the information contained in them changes qualitatively, but not as much quantitatively. The need for a high sensibility, in terms of transitions detected in order to avoid overlooking them, produces a decrease in the specificity, in such a way that certain false transitions appear as such when the energy is extracted. However, the comparison at different scales and using different temporal intervals reduces significantly these misconstrued normal changes while preserving the right ones. The promising results obtained in the tests which have been implemented confirm the usefulness of the multiple comparison of the images, since they endow us with a much more robust discrimination criterion.
References 1. Alemán-Flores, M., Álvarez-León, L.: Texture Classification through Multiscale Orientation Histogram Analysis. Lecture Notes in Computer Science, Springer Verlag 2695 (2003) 479-493 2. Columbia University and Utrecht University. Columbia-Utrecht Reflectance and Texture Database. http://www.cs.columbia.edu/CAVE/curet/.index.html 3. Paragios, N., Deriche, R.: Geodesic Active Regions and Level Set Methods for Supervised Texture Segmentation. International Journal of Computer Vision 46:3 (2002) 223 4. Weickert, J.: Multiscale texture enhancement, V. Hlavac, R. Sara (Eds.), Computer Analysis of Images and Patterns, Lecture Notes in Computer Science Springer Berlin 970 (1995) 230-237 5. Evans, L.: Partial Differential Equations. American Mathematical Society (1998) 6. Lindeberg, T.: Scale Space Theory in Computer Vision. Kluwer Academic Publishers (1994) 7. Álvarez, L., Mazorra, L.: Signal and Image Restoration Using Shock Filters and Anisotropic Diffusion. SIAM J. on Numerical Analysis 31:2 (1994) 590-605 8. Bescós, J.: Shot Transitions Ground Truth for the MPEG7 Content Set. Technical Report 2003/06. Universidad Autónoma de Madrid (2003)
Estimation of Common Groundplane Based on Co-motion Statistics Zoltan Szlavik1, Laszlo Havasi2, and Tamas Sziranyi1 1
Analogical and Neural Computing Laboratory, Computer and Automation Research Institute of Hungarian Academy of Sciences, P.O. Box 63, H-1518 Budapest, Hungary {szlavik, 2
sziranyi}@sztaki.hu
Peter Pazmany Catholic University, Piarista köz 1., H-1052 Budapest, Hungary {havasi}@digitus.itk.ppke.hu
Abstract. The paper presents a method for groundplane estimation from imagepairs even if unstructured environment and motion. In a typical outdoor multicamera system the observed objects might be very different due to the noise coming from lighting conditions, camera positions. Static features such as color, shape, and contours cannot be used for image matching in these cases. In the paper a method is proposed for matching partially overlapping images captured by video cameras. Using co-motion statistics, which is followed by outlier detection and a nonlinear optimization, does the matching. The described robust algorithm finds point correspondences in two images without searching for any structures and without tracking any continuous motion. Real-life outdoor experiments demonstrate the feasibility of this approach.
1 Introduction Multi-camera based observation of human or traffic activities is becoming of increasing interest for many applications like cases of semi-mobile traffic control using automatic calibration or tracking humans in a surveillance system. In a typical outdoor scenario, multiple objects, such as people and cars, move independently on a common ground plane. Transforming the activity captured by distributed individual video cameras from local image coordinates to a common frame then sets the stage for global analysis and tracking of the activity in the scene. Matching different images of a single scene could be difficult, because of occlusion, aspect changes and lighting changes that occur from different views. Over the years numerous algorithms for image and video matching have been proposed. Still-mage matching algorithms can be classified into two categories. In “template matching” the algorithms attempt to correlate the gray levels of image patches, assuming that they are similar [3] [7]. This approach appears to be valid for image pairs with small difference; however it may be wrong at occlusion boundaries and within featureless regions. In “feature matching” the algorithms first extract salient primitives from images (edges or contours) and match them in two or more views [1][4][5][6]. An A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 347–354, 2004. © Springer-Verlag Berlin Heidelberg 2004
348
Z. Szlavik, L. Havasi, and T. Sziranyi
image can then be described by a graph with primitives as nodes and geometric relations defining the links. The registration then becomes the mapping of the two graphs: subgraph isomorphism. They may fail if the chosen primitives cannot be reliably detected. The views of the scene from the various cameras might be very different, so we cannot base the decision solely on the color or shape of objects in the scene. In a multi-camera observation system the video sequences recorded by cameras can be used for estimating matching correspondences between different views. Video sequences contain much more information than the scene structure of any individual frame as also capturing information about scene dynamic. The scene dynamic is an inherent property of the scene; it is common to all video sequences recorded from the same scene, even when taken from different cameras from different positions at different zooms. In [9][10] approaches were presented, which align tracks of the observed objects. In these cases the capability of robust object tracking is assumed and this is the weak point of the method. It must be assumed that the speed doesn’t change more than a predefined value and the objects in the scene are moving continuously. In our experiment we use standard PAL digital cameras with wide angle. So, the field of view is large, consequently, the size of features is small; the images are blurred and noisy. The common field of view of two neighboring cameras is less than 30%. We have tested several correlation-based toolboxes for matching, but they gave poor results. In case of several randomly moving objects on the screen, the conventional 3D registration of cameras usually needs some a priori object definition or human interaction to help the registration. The approach we propose in the paper is a serious extension of previously published sequence-based image matching methods [9][10] for non-structured estimation. It aims to use statistics of concurrent motions – the so called co-motion statistics – instead of trajectories of moving objects to find matching points in image pairs. The input of the system is video sequences from fixed cameras at unknown positions, orientations and zooms. After matching of images the system aligns the multiple camera views into a single frame making possible to track all moving objects across different views. In our approach no a-priori information is needed, and the method also works well in images of randomly scrambled motion, where other methods fail because of the missing fixed structures.
2 Common Groundplane Estimation The main steps of our algorithm: 1. Motion detection; record point coordinates where motion is detected; 2. Update local and remote statistical maps (the notion of statistical maps is defined in Section 2.2); 3. Extract candidate point pairs from statistical maps; 4. Outlier rejection; 5. Fine tune of point correspondences by minimizing the reprojection error between sets of candidate point pairs; 6. Alignment of the two views.
Estimation of Common Groundplane Based on Co-motion Statistics
349
The major assumption is the time synchronization between the cameras. When it exists, the motion information can be transformed into motion-statistics. Later we will show that by using further processing this assumption can be avoided.
2.1 Motion Detection Our application field has several special requirements to motion extraction. The videos of open-air traffic were created with normal digital cameras with wide angle so the images are blurred and noisy. The size of moving objects has a great variety; there are small blobs (walking people) and huge blobs (trams or buses), too. The background cannot be extracted perfectly, because we do not want to assume any a-priori knowledge about the scene. In the first step we define pixels, which are considered in the statistical calculus. The motion blobs are extracted by using simple running-average background subtraction with large to delete the irrelevant parts by using the reference image
This method is fast and very sensible with low threshold value. Some disadvantage comes from the cases that often detect noises and background flashings. In the preprocessing algorithm the detected motion blobs are dilated while these are reaching the local maximums of edge maps; we found local maximums of the edge map by using similar algorithm to that of proposed by Canny [11]. This approach seems a usable solution to detect the significant moving objects in the scene. In our method we do not need precise motion detection and object extraction, because of the later statistical processing these minor errors are irrelevant. The binarized image with the detected objects is the motion map, which is used for updating statistical maps.
2.2 Co-motion Statistics For finding point correspondences between two images in case of wide baseline stereo and video sequences we have decided to analyze the dynamics of the scene. To do it co-motion statistics (statistics of concurrent motions) were introduced. In case of single video sequence a motion statistical map for a given pixel can be recorded as follows: when motion is detected in a pixel, the coordinates are recorded of all pixels where motion is also detected at that moment. In the motion statistical map the values of the pixels at the recorded coordinates are updated. After all, this statistical map is normalized to have global maximum equal to 1. In case of stereo video sequences to each point in the images, two motion-statistic maps are assigned: a local and a remote. Local map means the motion-statistical map in the image from the pixel is selected, the remote motion-statistical map is refer to the motions in the other image. After the motion detected on the local side, for the points defined by the local motion map the local statistical map updated by the local motion map. For each point where motion is detected on the local side, the local motion map of the remote side updates the corresponding remote statistical map. Examples of co-motion statistics are given in Fig. 1.
350
Z. Szlavik, L. Havasi, and T. Sziranyi
Fig. 1. Remote statistical maps for different cases are in the pictures. In the left one for the point, which is not in the cameras’ common field view; in the right one for the point from cameras’ common field of view.
2.3 Outlier Rejection As candidate matches we choose global maximums on local and remote statistical images. For the rejection of outliers from the set of point correspondences we applied the principle of “good neighbors” and analyze the errors’. The principle of “good neighbors” says that if we have a good match, then we will have many other good matches in some neighbor of it. Consider a candidate match where is a point in the first image and is a point in the second image. Let and be the neighbors of and If is a good match, we will expect to see many other matches where if then So, candidate pairs for which less other candidate pairs could be found in their neighborhood were eliminated.
Fig. 2. The global maximum is the red circle, while should be the blue circle.
The reduced set of point-correspondences also has erroneous matches due to the errors caused by recording of co-motion statistics: 1. From global statistical map, we know image regions where much more moving objects are detected than in other places. If we have a point in the first image where the correspondent scene location is not in the field of the view of the second camera, then the correspondent maximum in remote statistical image will be in a wrong place, in a point, where value of the motion statistics in the global statistic is high, see Fig. 2. 2. Because of the size of the moving objects the global maximums could be shifted and it will be somewhere in the neighborhood of the desired correspondent point. These “shifting” results in cases where different points from local statistical images are “mapped” onto the same point in remote statistical images.
Estimation of Common Groundplane Based on Co-motion Statistics
351
To solve the first problem we need to eliminate points from the set of candidate matches if the global maximum on remote statistical images is a pixel where the value of the motion statistics is greater than some predefined parameter. To get over on second problem we also need to eliminate points from the set of candidate matches if the global maximum on remote statistical image is a pixel, which also present in another candidate pair.
2.4 Fine-Tuning of Point Correspondences The above described outlier rejection algorithm results in point correspondences, but these results must be fine-tuned for the alignment of two views. For the alignment of two images a transformation is estimated from the extracted point correspondences. The results of the transformation can be seen in Fig. 3.
Fig. 3. In the upper image the view of the left camera, below it, the transformed view of the right camera can be seen.
It can be seen that the resulted transformation is not the desired one: the continuous edges are broken if a composite view is generated from the transformed images. The point coordinates can contain errors; they can be shifted by some pixels, due to the nature of co-motion statistics recording. Even if we have 1 pixel error in point coordinates the fine alignment of the images cannot be done. This simple outlier rejection algorithm must be followed by a robust optimization to fine tune point correspondences and obtain subpixel accuracy. An iterative technique is used to refine both the point placements and the transformation. The method used is the Levenberg-Marquardt iteration [12] to minimize the sum-of-square difference between the obtained coordinates and the transformed values. The entries of the transformation matrix as well as the coordinates of points in right camera’s image are treated as variable parameters in the optimization, but the point coordinates of the left camera’s image are kept constant. The initial condition for this iteration is the entries of the transformation matrix and point’s coordinates estimated by using the above-described outlier rejection algorithm.
3 Time Synchronization Until know, we have assumed that the cameras’ clocks are synchronized. For time synchronization many algorithms have been developed, e.g. the Berkeley algorithm.
352
Z. Szlavik, L. Havasi, and T. Sziranyi
In our case, if the cameras are not synchronized than the generated co-motion statistics should no longer refer to concurrent motions detected in two stereo sequences. So, when we apply our algorithm for outlier rejection, we do not get a “large” set of point correspondences, but more point correspondences can be extracted in the case of synchronized sequences.
Fig. 4. Cardinality of the set of point correspondences for different time offset values. The maximum is at 100 frames, which means that the offset between two sequences is 100 frames.
Since this observation is obvious and true in practice, we calculate point correspondences for different time offset values then perform a one-dimensional search for the largest set of point correspondences to synchronize the sequences, see Fig. 4.
Fig. 5. The change of the error rate for different offset values is in the diagram. The minimum is at 100 frames as the maximum for the cardinalities of sets of point correspondences.
It can be seen in Fig. 4 that even in the case of unsynchronized sequences the algorithm produces point correspondences. But if we analyze the sum-of-square differences score (the reprojection error in this case), see Fig. 5, we found that the global minimum is at offset value 100 frames, as the maximum in the Fig. 4 for the cardinalities of sets of point correspondences. This means that the global optimum is at offset value 100 frames, in all other cases the obtained point correspondences mean that the algorithm finds a local optimum.
5 Results The above-described approach was tested on videos captured by two cameras, having partially overlapping views, at Gellert (GELLERT videos) and Ferenciek squares (FERENCIEK videos) in Budapest. The GELLERT videos are captured at resolution
Estimation of Common Groundplane Based on Co-motion Statistics
353
160×120, at same zoom level and with same cameras while the FERENCIEK videos are captured at resolution 320×240, at different zoom levels and with different cameras. The common field of view of the two cameras in both cases is about 30%. The proposed outlier rejection algorithm rejects most (98%) of the candidate point pairs. For the GELLERT videos it results in 49 point-correspondences and in 23 for FERENCIEK videos, which are still enough to estimate common groundplanes. The computation time of the whole statistical procedure was about 10 minutes for 10 minutes of video presented in the figures. For longer sequences and higher resolution we apply a two-step procedure: the generated statistical maps are of resolution 80×60, then, based on them, the fine-tuning of point-correspondences was done at the video’s native resolution.
Fig. 6. The constructed composite views are in the pictures. The upper one is generated for the GELLERT videos; the lower one is for the FERENCIEK videos.
6 Conclusions The paper has shown that for free-placed outdoor cameras the common groundplane can be estimated without human interaction in case of arbitrary scenes. In our approach no a-priori information is needed, and the method also works well in images of randomly scrambled motion, where other methods fail because of the missing fixed structures. In our approach we introduced co-motion statistics to find matching points in image pairs. We first record motion statistics and then choose global maximums as candidate matches. This step is followed by an elimination of outliers from the set of candidate matches and an optimization based on the minimization of the reprojection error between images, to fine tune the locations of candidate pairs.
354
Z. Szlavik, L. Havasi, and T. Sziranyi
Acknowledgements. The authors would like to acknowledge the support received from the Hungarian National Research and Development Program, TeleSense project grant (NKFP) 035/02/2001.
References O. D. Faugeras, Q.-T Luong, S.J. Maybank: Camera self-calibration: Theory and experiments, ECCV ’92, Lecture Notes in Computer Science, Vol. 588, Springer-Verlag, Berlin Heidelberg New York (1992) 321-334 2. R. Hartley: Estimation of relative camera positions for uncalibrated cameras, Proc. of ECCV’92, Lecture Notes in Computer Science, Vol. 588, Springer-Verlag, Berlin Heidelberg New York (1992) 3. D. H. Ballard, C. M. Brown: Computer Vision, Prentice-Hall, Englewood Cliffs NJ (1982) 4. S. T. Barnard, W. B. Thompson: Disparity analysis of images, IEEE Trans. PAMI, Vol. 2(4) (1980) 333-340 5. J. K. Cheng, T. S. Huang: Image registration by matching relational structures, Pattern Recog., Vol. 17(1) (1984) 149-159 6. J. Weng, N. Ahuja, T. S. Huang: Matching two perspective views, IEEE Trans. PAMI, Vol. 14(8) (1992) 806-825 7. Z. Zhang, R. Deriche, O. Faugeras, Q.-T. Luong: A robust technique for matching two uncalibrated images through the recovery of the unknown Epipolar Geometry, Artificial Intelligence Journal, Vol.78 (1995) 87-119 8. H. C. Longuet-Higgins: A computer algorithm for reconstructing a scene from two projections, Nature, Vol. 293 (1981) 9. L. Lee, R. Romano, G. Stein: Monitoring activities from multiple video streams: establishing a common coordinate frame, IEEE Trans. PAMI, Vol. 22(8) (2000) 10. Y. Caspi, D. Simakov, and M. Irani: Feature-based sequence-to-sequence matching (2002) 11. J. Canny, A computational approach to edge detection, IEEE Trans. on Pattern An. and Mach. Intell., Vol. 8(6), (1986) 679-698 12. Press, W.H., B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, Cambridge (1986) 1.
An Adaptive Estimation Method for Rigid Motion Parameters of 2D Curves Turker Sahin and Mustafa Unel Department of Computer Engineering, Gebze Institute of Technology Cayirova Campus 41400 Gebze/Kocaeli Turkey {htsahin,munel}@bilmuh.gyte.edu.tr
Abstract. A new method is presented for identifying rigid motion of free-form curves based on “related-points” extracted from the decomposition of implicit polynomials of these curves. Polynomial decomposition expresses the curve as a unique sum of products of (possibly) complex lines. We show that each real intersection point of these lines, i.e. relatedpoints, undergoes the same motion with the curve, and therefore they can be used for identifying the motion parameters of the curve. The resulting tuning algorithm is verified by experiments.
1
Introduction
Algebraic curves have proven very useful in many model-based applications in the past decades. These implicit models have been used widely for important computer vision tasks like single computation pose estimation, shape tracking, 3D surface estimation and indexing into large pictorial databases [1,2,3,4,5,6,7, 8,9]. In this paper we are interested in identifying the rigid motion of two dimensional planar algebraic curves. For some references on dynamics of curves, see [10,11,12]. We will use a unique decomposition of algebraic curves to obtain feature points for motion estimation. Decomposition represents such curves as a unique sum of products of (possibly) complex lines. The real intersection points of these lines are shown to be related-points, which undergo the same motion as the curve. For rigid motion the equations which describe the related points are in the form of a continuous linear plant with unknown parameters. We develop an adaptive tuning algorithm for estimating these motion parameters. Starting with random values, parameters are updated in an adaptive fashion. Convergence of the estimation error is established by a Lyapunov analysis.
2
2D Curves and Their Implicit Representations
2D curves can be modelled by implicit algebraic equations of the form where is a polynomial in the variables i.e. where is finite) and the coefficients are real numbers [1]. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 355–362, 2004. © Springer-Verlag Berlin Heidelberg 2004
356
T. Sahin and M. Unel
Fig. 1. A group of 2D Objects and their free form 3L curve models
Algebraic curves of degree 1, 2, 3, 4, ... are called lines, conics, cubics, quartics, ... etc. Figure 1 depicts some objects used in our experiments with their outlines modelled by a curve fitting procedure detailed in [14]. In the following sections, we will focus on quartics for our analysis, which can be generalized to higher degree curves.
Decomposed Quartics and Related Points
3 3.1
Non-visual Line Factor Intersection Points
It has been shown in [3,4] that algebraic curves can be decomposed as a unique sum of line factors, the intersection of which are examples of related-points1. Considering an accordingly decomposed monic quartic curve:
the intersection point as matrix/vector relation:
1
and
of any two non-parallel line factors, such can be defined by the
An acronym for real equivalent locations that affine transformations equate directly.
An Adaptive Estimation Method for Rigid Motion Parameters
357
Fig. 2. A quartic Boomerang shaped curve decomposed into its (complex) line factors and nonvisual intersection points
For closed-bounded quartics, implies two pairs of complex-conjugate lines, i.e. and the intersection points of which are real. Figure 2 depicts these non-visual points of a quartic curve along with 6 complex lines from the decomposition of the curve.
4 4.1
Rigid and Affine Motion of Planar Curves in a Plane Rigid and Affine Motion
An affine motion can be described as
In the special case where the 2 × 2 matrix matrix, i.e.
is a skew-symmetric the motion will be termed as a
rigid motion. Most practical motion types are virtually rigid with very little or no change in object shape during its route, which will be the focus of this paper.
4.2
Affine Equivalence and Related Points
In general, any two degree curves defined by a monic monic will be affine equivalent if for some scalar
where A represents an affine transformation.
and a
358
T. Sahin and M. Unel
Two corresponding related-points of the affine equivalent curves defined by and such as and respectively, will be defined by the condition that
In light of ( 4), any two corresponding related-points will satisfy the relation
4.3
Line Factor Transformations
Under an affine transformation A, every
in ( 1), namely
for a real scalar and monic line factors with or 2. Therefore, under an affine transformation A, the implicit polynomial defined by ( 1) will imply
a unique monic polynomial that is affine equivalent to
namely
Each of and each corresponding of an affine equivalent will have the same number line factors. Moreover, in light of ( 7), all of these factors will map to one another under affine transformations. Thus and will have the same number of corresponding related-points, as defined by the intersections of their corresponding line factors. Also these related points can be determined rather easily and precisely from an IP equation, thus are very suitable for the analysis of affine and rigid curve motion.
An Adaptive Estimation Method for Rigid Motion Parameters
5
359
Identification of Rigid Motion Parameters
Equation ( 3) is a linear plant of the form
where is an unknown constant matrix. To estimate the unknown parameters, we can construct an estimator of the form [13]
If the state error and parameter errors are defined as
then the error equations are given by
where is a stability matrix, i.e. its eigenvalues have negative real parts. The problem is to adjust the elements of the matrix or equivalently so that the quantities tend to zero as We choose the adaptive law to be
where P is a symmetric positive-definite matrix (P > 0), which satisfies the Lyapunov equation, namely
where Q is a positive-definite matrix (Q > 0). This law ensures the global stability of the overall system with the output error tending to zero asymptotically. However, the state error is only asymptotically convergent to zero. The convergence of the parameters to their true values depend on the persistent excitation of the regressor matrix which is guaranteed for rigid motion. This method can be summarized in the following algorithm: 1. Initialize as a stable matrix, as a positive definite matrix and as a random vector. 2. Acquire the contour data of the object in motion at the sampling instant (s). 3. Fit a curve to the data using any Euclidean invariant fitting method. 4. Decompose the curve into its line factors using (1) and compute its related points using ( 2) . The mean point of these related points is assigned to 5. Solve the Lyapunov Equation in ( 14) and use the resulting P matrix in the update law ( 13) to obtain Then use in ( 11) for updating
360
T. Sahin and M. Unel
Fig. 3. A racket data subjected to additive noise with on the left; undergoing rigid motion with and in the middle; and its parameter estimates on the right
Fig. 4. An Ericsson cellular phone data subjected to additive noise with on the left; undergoing rigid motion with and in the middle; and its parameter estimates on the right
6
Experimental Results
For our experiments object boundaries have been modelled by quartic curves. The related points of these curves are obtained from the decomposition of the curve as in ( 1); and their mean point is employed by the estimator model of Section 5 for motion parameters. The stability matrix and Q are selected to be:
and
which enable quick settling times.
Accordingly, the Lyapunov matrix P is calculated for the update law ( 13). Since the decomposition is essentially noisy, it is important to see the accuracy of parameter estimates for noisy data. Two such examples have been presented. The first is a racket in Figure 3, which was perturbed by the additive noise of standard deviation while undergoing rigid motion with
An Adaptive Estimation Method for Rigid Motion Parameters
361
parameters and Despite the noise level, the adaptive estimation technique gives good parameter estimates with quick convergence. The second example is an Ericsson phone in Figure 4 undergoing rigid motion of parameters and with Gaussian noise of Although the data is prone to deformation with a narrow antenna section, the estimates are again accurate, which emphasize the robustness of this method.
7
Summary and Conclusion
The main message of this paper is that - rigid motion parameters of curves defined by implicit polynomial equations can be estimated using their relatedpoints. The proposed tuning algorithm uses three related points as state feedback signal(s) to directly determine the motion parameters for a moving curve. Experiments with noisy data have been conducted, which verify the robustness of the adaptive tuning algorithm. Acknowledgments. This research was supported from GYTE research grant BAP #2003A23.
References 1. C.G. Gibson, “Elementary geometry of algebraic curves”, Cambridge University Press, Cambridge, UK, 1998. 2. D. Keren et al., “Fitting curves and surfaces to data using constrained implicit polynomials,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 1, January 1999. 3. M. Unel, W. A. Wolovich, “On the construction of complete sets of geometric invariants for algebraic curves,” Advances in Applied Mathematics Vol. 24, No. 1, pp. 65-87, January 2000. 4. M. Unel, W. A. Wolovich, “A new representation for quartic curves and complete sets of geometric invariants,” International Journal of Pattern Recognition and Artificial Intelligence, December 1999. 5. J. L. Mundy, Andrew Zisserman, “Geometric invariance in computer vision, The MIT Press, 1992. 6. G. Taubin, D. B. Cooper, “2D and 3D object recognition and positioning with algebraic invariants and covariants,” Chapter 6 of Symbolic and Numerical Computation for Artificial Intelligence, Academic Press, 1992. 7. G. Taubin, F. Cukierman, S. Sullivan, J. Ponce and D.J. Kriegman, “Parameterized families of polynomials for bounded algebraic curve and surface fitting,” IEEE PAMI, March, 1994. 8. W. A. Wolovich, Mustafa Unel, “The determination of implicit polynomial canonical curves,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 10, pp. 1080-1089, October 1998. 9. W. A. Wolovich, Mustafa Unel, “Vision based system identification and state estimation,” The Confluence of Vision and Control, Springer Lecture Notes in Control and Information Sciences, No. 237, pp. 171-182, 1998.
362
T. Sahin and M. Unel
10. R. Cipolla and A. Blake, “Surface shape from the deformation of apparent contours,” Internat. J. Comput. Vision, vol. 9, no. 2, 1992, pp. 83-112. 11. O. D. Faugeras, “On the motion of 3-D curves and its relationship to optical flow,” in: O.D.Faugeras, ed., Proc. 1st ECCV (Springer, Berlin, 1990), pp. 107-117. 12. O. D. Faugeras and T. Papadopoulo, “A theory of the motion fields of curves, Internat. J. Comput. Vision, ” vol. 10, no. 2, pp. 125-156, 1993. 13. K. S. Narendra, A. M. Annaswamy, “Stable Adaptive Systems,” 1989 by PrenticeHall, Inc. 14. Z. Lei, M. M. Blane and D. B. Cooper, “ 3L Fitting of Higher Degree Implicit Polynomials,g In proceedings of Third IEEE Workshop on Applications of Computer Vision, pp. 148-153, Florida 1996.
Classifiers Combination for Improved Motion Segmentation Ahmad Al-Mazeed, Mark Nixon, and Steve Gunn University of Southampton, Southampton, SO17 1BJ, UK {aha01r,msn,srg}@ecs.soton.ac.uk
Abstract. Multiple classifiers have shown capability to improve performance in pattern recognition. This process can improve the overall accuracy of the system by using an optimal decision criteria. In this paper we propose an approach using a weighted benevolent fusion strategy to combine two state of the art pixel based motion classifiers. Tests on outdoor and indoor sequences confirm the efficacy of this approach. The new algorithm can successfully identify and remove shadows and highlights with improved moving-object segmentation. A process to optimise shadow removal is introduced to remove shadows and distinguish them from motion pixels. A particular advantage of our evaluation is that it is the first approach that compares foreground/background labelling with results obtained from ground truth labelling.
1
Introduction
The objective of achieving the best performing pattern recognition classifiers leads to different designs of high performance algorithms. Classifiers differ in their classification decision suggesting that different classifiers designs potentially offer complementary information about the patterns to be classified which can be harnessed to improve performance of the selected classifier [1]. In this paper two motion classifiers are combined using Bayes theorem while considering the confidence of each classifier to optimise the motion classification process.
2
Motion Detection
The detection of moving objects is an essential part of information extraction in many computer vision applications including: surveillance and video coding. Background differencing is a well established basis for moving object extraction. In more refined approaches, statistical methods were used to form the background model. Horprasert et. al. [2] introduced a new computational colour model which separates the brightness from the chromaticity component. The algorithm can detect moving objects and can distinguish shadows from a background. Pfinder [3] uses a multiscale statistical model of colour and shape with a single Gaussian per pixel to model the background. It succeeded in finding a A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 363–371, 2004. © Springer-Verlag Berlin Heidelberg 2004
364
A. Al-Mazeed, M. Nixon, and S. Gunn
2-D representation of head, hands and feet locations of a moving human subject. In contrast, Friedman and Russell [4] took a simpler approach to modelling the statistical nature of the image by using a single distribution to model the whole of the background and two other distributions to model the variability in shadows and moving objects. Elgammal et al. [5] used a Gaussian density estimator as a kernel in the process of background modelling. The final background model is updated by combining a short and a long term model of the background. Often multiple surfaces appear on a particular background pixel and the lighting conditions change. Therefore, to robustly model a multi-modal background, the multiple adaptive Gaussians can be used. In addition, a mixture of Gaussians model is a very appealing approach to data fitting as it scales favourably with dimensionality of the data, has good analytic properties and many data sets form clusters which are approximately Gaussian in nature [6]. Stauffer and Grimson [7] presented an online algorithm based on a statistical method using a mixture of Gaussians. The persistence and the variance of each of the Gaussians is used to identify background distributions. The approach was designed to deal robustly with multimodal backgrounds, lighting changes, repetitive motions of scene elements. The method lacks the capability to remove shadows and highlights. This method was further extended using an EM algorithm in [8] to track motion and in [9] to track faces. The method was also used with image mosaicing techniques to build panoramic representations of the scene background [10]. Magee [11] used a projective ground-plane transform within the foreground model to strengthen object size and velocity consistency assumptions with the mixture of Gaussians background modelling method. Such techniques form a good base for building a better approach. Skillful combination of such methods by holding to the strong points and removing the weaknesses can eventually result in a better technique. In the following sections we describe two standard pixel-based motion extraction approaches based on mixture of Gaussians [7] and another based on statistical properties of the colour model [2]. These are combined in Sect.(4). The segmentation analysis is given in Sect.(5). Further comparison of outdoor vs. indoor extraction in Sect.(5) confirms the efficacy for this approach, prior to suggestions for future avenues for research.
3 3.1
Motion Extraction Mixture of Gaussians Algorithm (MOG)
This approach models the background with independent distributions that are updated on-line. The recent history of each pixel is modelled as a mixture of K Gaussian distributions. The probability of a pixel intensity,
where K is the number of distributions, distribution, is the mean value for the
is the weight estimate for the distribution, and is the co-
Classifiers Combination for Improved Motion Segmentation
365
variance matrix for the distribution. is a Gaussian probability density function formed from the multivariate Gaussian
where is the input dimension which is 3 for the (RGB) colour model and is approximated by Every new pixel value, is compared to the existing K Gaussian distributions. The pixel is classified to be in a particular distribution if the pixel is within 2.5 times the standard deviation of the distribution. The pixel is checked against the background distributions first and then to the foreground distributions. The distributions are ordered according to the ratio of the weight over the standard deviation of each distribution, This process will rank the most probable (those with high weight and low variance) to the least probable background distributions (those with low weight and high variance). The background model is formed from a number of background distributions
where controls the number of modes of variations in the background. If a pixel does not match any of the K distributions, the pixel will be considered as a new distribution replacing the distribution with the smallest The new distribution mean, will be the pixel value. The prior weight of the new distribution will be set to a low weight and the variance to a high variance. After evaluating a new pixel, the K distributions prior weights are updated at time
where is the learning rate. is 1 for the matching distribution, and 0 for the remaining distributions. The weights are normalised after this process. The value of and are updated only for the matching distribution
where If a non-background pixel (part of a moving object) does not move over a period of time, its distribution weight over time will increase and its variance will decrease until this pixel becomes part of the background model.
366
3.2
A. Al-Mazeed, M. Nixon, and S. Gunn
Statistical Background Disturbance Technique (SBD)
This algorithm decomposes the colour space using prior knowledge established on a statistical computational model to separate the chromaticity from the brightness component. The algorithm initially uses N frames to form the background model. From these frames, the mean and the variance is computed for each colour band (RGB) in each pixel. The chrominance distortion, CD, and the brightness distortion, between the background model and a new pixel, are computed as
where and now respectively represent the mean and the standard deviation for each background pixel colour band. The normalised chrominance distortion, and the brightness distortion, are used to classify the new pixel
where FG, BG, S and G resembles: foreground, background, shadow and highlights respectively. and are thresholds used to specify the borders of the foreground. and are thresholds used to identify the borders of the background. These thresholds are determined automatically through a statistical learning procedure [2]. Through the background building process a histogram is constructed for and The thresholds are then computed after fixing a detection rate which fixes the expected proportions of the image contents.
4
Combining Motion Classifiers
The combination of the two classification algorithms evolved due to the performance of both algorithms in classification and the shadow extraction feature provided by the SBD algorithm. The fact that both algorithms operate using pixel wise operations facilitated the process of combination. The two classification algorithms are combined using Bayes theorem,
Whenever the classifiers agree on a certain decision (whether a pixel is a foreground pixel or a background pixel), the decision will be set to such decision. On the other hand, if the classifiers disagree then the conditional probability for
Classifiers Combination for Improved Motion Segmentation
367
the chosen class by each classifier is calculated. The conditional probability for the Statistical Background Disturbance technique for a pixel being part of the background class is calculated as follows
where D is the distance between the tested pixel and the mean of the background distribution, and Var is the background variance. The Mixture of Gaussians algorithm provides the conditional probability for the background. The foreground conditional probability for the MOG algorithm Eqn. (12) is calculated from for the closest background distribution for the pixel
For the SBD algorithm the conditional probability for the foreground, is approximated by
The decision is then made according to the following equation
where is a class of either a background (BG) or a foreground (FG) for the classifier In Eqn. (14) the maximum conditional probability for each classifier is used with the classifier’s confidence measure to find the final decision for the algorithm. satisfy a sum to unity condition
The priors and are calculated using a training set of N frames. In the training process, an exhaustive search method is performed by changing the weights incrementally between zero and one until an optimal value is reached giving the minimum classification errors. The shadows are removed using the detection criteria in the Statistical Background Disturbance algorithm only, where there is no such feature in the Mixture of Gaussians algorithm. To optimise this process a threshold distance between the background mean and a virtual border for the shadow class is determined (using the same process used for the priors). Any shadow pixel with a distance exceeding the shadow border will be considered as a motion pixel.
5
Experimental Results
The presented algorithms were tested on indoor and outdoor sequences of walking human subjects. In testing the algorithms we used outdoor sequences of size 220 × 220 pixels and indoor sequences of size 720 × 367 pixels, with 77-81 images
368
A. Al-Mazeed, M. Nixon, and S. Gunn
Fig. 1. Comparing the tested algorithms.
per sequence. The resulting extracted motion frames for the indoor sequences are compared with the silhouettes provided by the University of Southampton database [12]. The silhouettes were generated by chroma-key extraction of the green background. The total difference between the algorithm’s extractions and the silhouettes is calculated for each image as a count of the mismatching pixels. This facilitates the comparison of the extraction procedure with a form of ground truth. The Statistical Background Disturbance algorithm and the Mixture of Gaussians algorithm were trained initially with a background sequence of 50 frames. The MOG was used with 5 distributions per pixel. Each new distribution created was set to an initial weight of 0.05 and an initial variance equal to the largest variance of all the background pixels for the indoor sequences (double the background variance for the outdoor). The frames were tested with a background threshold (T) of 0.4 (0.6 for outdoor) and a learning rate of 0.05. The combined algorithm was tested on ten indoor sequences. All the combined algorithm results were better than both the Mixture of Gaussians (with 5 distributions) and the Statistical Background Disturbance techniques. The results of the mean error for the extraction of 10 indoor sequences is shown below in Fig.(1). The chart values are produced by finding the percentage of all misclassified pixels (comparing the current extraction with the silhouette). The performance measure used to evaluate each method is
By this performance measure, the combined algorithm shows an improvement of more than 10% over the MOG algorithm and more than 3% over the SBD
Classifiers Combination for Improved Motion Segmentation
369
Fig. 2. Two examples of indoor images extracted with the tested algorithms.
algorithm. Table (1) shows the overall performance of the tested algorithm on all the indoor sequences.
Samples of extracted indoor sequences are shown in Fig. (2). The samples were chosen so as to show the performance advantage of the new algorithm over the MOG and the SBD algorithms. The output image for the Mixture of Gaussian algorithm gave a fine motion extraction but with noise on the background and shadows accompanying the moving object. Some of the output images produced by the SBD algorithm have holes in the moving object (usually holes vary in sizes). The extraction by the SBD (Fig 2.c) has misclassified small parts of the legs though with less shadow and a cleaner background. The best result is given by the output image for the combined algorithm with a clean background and fine extracted moving object (in some of the extracted sequence small parts of the shadow still persist). For outdoor sequences since the environment is more complex, it is possible to have more pixels mistakenly labelled in the combined algorithm. The combined algorithm can improve the outdoor motion extraction as shown in Fig. (3).
370
A. Al-Mazeed, M. Nixon, and S. Gunn
Fig. 3. Two examples of outdoor images using the three extraction algorithms.
6
Conclusions
This paper presents a new motion extraction algorithm by combining two motion classifiers. A comparison between the new algorithm and their original versions was prepared using controlled laboratory data, and outside data. The combined algorithm shows that combination of pixel-based motion segmentation algorithms can improve segmentation performance. This suggests that applying more advanced ensemble methods could provide further performance improvement.
References 1. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE TPAMI 20 (1998) 226–239 2. Horprasert, T., Harwood, D., Davis, L.: A statistical approach for real-time robust background subtraction and shadow detection. In: Proc. ICCV’99. (1999) 1–19 3. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. IEEE TPAMI 19 (1997) 780–785 4. Friedman, N., Russell, S.: Image segmentation in video sequences: a probabilistic approach. In: Proc. UAI97. (1997) 175–181 5. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground modeling using non-parametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90 (2002) 1151–1163 6. Roberts, S., Husmeier, D., Rezek, I., Penny, W.: Bayesian approaches to Gaussian mixture modeling. IEEE TPAMI 20 (1998) 1133–1142 7. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE TPAMI 22 (2000) 747–757 8. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proc. AVBS’01. (2001)
Classifiers Combination for Improved Motion Segmentation
371
9. S.J. McKenna, Y.R., Gong, S.: Tracking colour objects using adaptive mixture models. Image and Vision Computing 17 (1999) 225–231 10. Mittaland, A., Huttenlocher, D.: Scene modeling for wide area surveillance and image synthesis. In: Proc. CVPR’2000. Volume 2. (2000) 160–167 11. Magee, D.R.: Tracking multiple vehicles using foreground, background and motion models. Image and Vision Computing 22 (2004) 143–155 12. Shutler, J., Grant, M., Nixon, M., Carter, J.: On a large sequence-based human gait database. In: Proc. of RASC 2002. (2002) 66–71
A Pipelined Real-Time Optical Flow Algorithm Miguel V. Correia*,1,2 and Aurélio Campilho1,2 1
2
Instituto de Engenharia Biomédica, Laboratório de Sinal e Imagem Biomédica Universidade do Porto, Fac. Engenharia, Dept. Eng. Electrotécnica e Computadores Rua Dr. Roberto Frias, s/n, 4200–465 Porto, Portugal {mcorreia,campilho}@fe.up.pt
Abstract. Optical flow algorithms generally demand for high computational power and huge storage capacities. This paper is a contribution for real-time implementation of an optical flow algorithm on a pipeline machine. This overall optical flow computation methodology is presented and evaluated on a set of synthetic and real image sequences. Results are compared to other implementations using as measures the average angular error, the optical flow density and the root mean square error. The proposed implementation achieves very low computation delays, allowing operation at standard video frame-rate and resolution. It compares favorably to recent implementations in standard microprocessors and in parallel hardware. Keywords: optical flow, real-time, motion analysis, pipeline hardware
1
Introduction
Processing of visual motion information is an important and challenging task in machine vision systems, because it may provide unique information about world motion and three-dimensional structure. Psychophysical and computational studies established direct relationship between the projected retinal optical velocity field with depth and motion [1]. This optical flow field provides information of spatial distribution of temporal image intensity evolution. Biologically inspired visual motion detection [2], motion segmentation to distinguish between self-motion and motion of different objects, time-to-impact measurement [3] or obstacle avoidance [4] are examples where the optical flow computation can be used. Optical flow algorithms generally demand for high computational power and huge storage capacities. Recent implementations report processing times in the order of several seconds per frame [5,6,7] for image sequences of moderate spatial resolution. Liu et al. in [5] identify three different types of hardware systems for real-time computation of optical flow: parallel computers, such as the Connection Machine, or the Parsytec Transputer Systems; special image processing hardware such as PIPE or Datacube; dedicated vision or non-vision VLSI chips. *
Supported by grant BD/3250/94 from Fundação para a Ciência e Tecnologia, Portugal
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 372–380, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Pipelined Real-Time Optical Flow Algorithm
373
The authors have also raised the issues of accuracy-efficiency trade-offs and give a detailed comparison analysis of gradient and correlation algorithms. Farnebäck [6] implemented a fast motion estimation algorithm based on orientation tensors and parametric models for a general purpose processor. Fleury et al. in [7] used a general purpose parallel machine for implementing gradient, correlation and phase-based methods for optical flow computation. They compared the performance of the architecture of the three methods and analysed the results. In the authors’ opinion the obtained speed-ups justified the parallelisation. Recently, we have applied optical flow computation as a tool to help the characterization of visual motion in studies of the human visual perception [8]. It was found that, in order to be useful, optical flow must be computed with short processing time spans. The objectives of the work presented in this paper were to develop an implementation of optical flow computation that provides a dense characterization of visual motion from image sequences at video frame-rates [9]. Section 2 discusses the issue of optical flow computation. In section 3, we propose and present the method and its implementation for optical flow computation in real-time on a pipeline image processor. Results are presented and discussed in section 4. The main conclusions and directions for future work are presented in section 5.
2
Computation of Optical Flow
The study of Barron et al. [10] on the performance of methods for optical flow computation shows several results of correlation, gradient, energy and phasebased methods on different kinds of image sequences. Also, according to the work of Simoncelli [11], determining optical flow can be performed through the implementation of spatial-temporal differential filters. Therefore, a careful evaluation of different methods led us to adopt the method of Lucas and Kanade [12] to compute optical flow due to the low computational complexity and rather good accuracy, as stated in Barron et al. [10]. In this gradient-based method, velocity is computed from first-order derivatives of image brightness, using the well known motion constraint equation:
where denotes the spatial gradient of image intensity, and is its partial temporal derivative. In order to obtain the two components of the velocity vector we impose a constant parametric model on the velocity field and assume spatial and temporal coherence locally, as in [12]. The optical flow on each pixel is obtained by computing [13]:
where, for
points in a neighbourhood,
at a single instant
374
M.V. Correia and A. Campilho
and
is a diagonal matrix of weighting coefficients used in the weighted least squares solution. For the implementation of the partial differential filters, we consider a separable smoothing filter in space-time, so that the spatial and temporal components can be computed independently. To compute spatial derivatives we use a FIR smoothing filter with a support of 5 × 5 pixels followed by numerical differentiation with a 3 tap FIR filter, both proposed by Simoncelli [11]. The temporal filter was designed according to the work of Fleet and Langley [13] by a cascade of two truncated IIR exponentials. Temporal smoothing and differentiation are both obtained simultaneously with this filter (see [13] for proof). The differentiation is simply obtained by two arithmetical operations (one addition and one multiplication by a constant), thus improving the computational cost. Total temporal delay is in the order of 3 frames, largely reducing storage requirements.
3
Real-Time Implementation
The real-time implementation was developed for the MaxVideo200 pipeline image processor using the VEIL-Virginia’s Extensible Imaging Library1 [14]. The pipeline computing architecture is particularly suited for real-time image processing because it follows a dataflow model of computation with deterministic execution time. The central element of the MaxVideo200 architecture is a 32 × 32 × 8 bit cross-point switch that provides a large interconnection flexibility among the available operators. The image processing capacity is concentrated in one arithmetical and logical unit and one processing unit, where local convolutions, morphological operations, feature extraction can be implemented. The global processing structure for optical flow computation is decomposed into the following stages: spatial-temporal filtering with local operators for space-time convolution, using a recursive filter composed by arithmetic and delay operators; integration in a local neighbourhood, by using look-up-tables and convolution operators, that optimise the use of the available hardware; computation of the weighted least squares solution, by integer addition, multiplication and numerical scaling operations. The global structure to implement the smoothing and differentiation in the space-time domain was integrated in a VEIL operator entitled DxDyDt. It receives as input the image I and outputs the spatial-temporal image gradients and The FIR kernel for spatial convolution is limited to 8 × 8 for a one-path convolution. Furthermore, only 8 bits integer kernel values are allowed, with a maximum output resolution of 40 bits. Taking into account these limitations we used an integer value approximation to the kernel proposed by Simoncelli [11] for image smoothing. The horizontal and vertical differential filters are also 1
http://www.cs.virginia.edu/ ~vision/projects
A Pipelined Real-Time Optical Flow Algorithm
375
implemented using integer convolution kernels. The output of the convolution operations is conveniently scaled in order to avoid overflow or decrease of resolution, as the word length of the output is limited to 16 bits. The recursive IIR filter referred in the previous section was implemented to perform the temporal smoothing and differentiation. order; The gradient method with local optimisation is based on the integration of the measurements of the spatial-temporal gradients in a local neighbourhood. This operation consists in the computation of the elements of the matrices and in expression 2, resulting in:
As stated by Fleet et al. [13], the integration in a local neighbourhood of the products and weighted by the coefficients can be expressed by a smoothing convolution operation, having a kernel with higher values in the centre. Thus, the integration can be decomposed into two main operations: computation of the products and convolution with a smoothing kernel. For the implementation of the first operation, we used look-up-tables, to avoid additional burden to the arithmetic unit. The look-up-tables are programmed to compute the products All the input terms and are 8-bit values and the results are represented in 16-bits. The square-value table has a length of 256 elements whereas the other tables have a length of 64 kbytes. All the elements needed for implementing look-up-tables in the pipeline processor are in the VEIL operator Lut. The second operation in the integration is achieved by a 16-bit convolution operation (with the VEIL operator Convolve ), using an integer 3×3 gaussian kernel. The computation of the matrices coefficients is illustrated in the diagram of Fig.1. The two components of velocity result from simple arithmetic implemented as shown on the diagram of Fig.2. The hardware implementation of the matrix operations in equation 2 involves a normalization operation with a coefficient set to the maximum between and in equation 3, such that:
where of the matrix in equation 3.
is used to compute the determinant
376
M.V. Correia and A. Campilho
Fig. 1. Smoothing, differentiation and local neighbourhood integration operations: the Look-Up-Tables implement the 8 bits square and multiplication operations the local neighbourhood integration is achieved by the Convolve operators configured with Simoncelli’s smoothing kernels [11].
The inverse operations were implemented by a look-up-table of integer elements with a length of 64 kbytes. The Maximum operator and the look-up-tables provide numerical scaling in order to obtain the most significant digits of the result in a fixed-point 16-bit resolution. Due to hardware constraints, the single available arithmetic unit operates only on 16-bit integer resolution operands and local memory has to be used for temporary storage. The Merge operators, indicated by the grey blocks on the diagram of Fig.2, concatenate several data streams side-by-side on a same physical memory of the pipeline processor. This allowed us to form image matrices of size where is the number of input data streams of M rows by N columns. It is then possible to select the parts of this image matrix that will form each output data stream with sizes of M × N or M × 2N. These streams can also be split at the operators inputs, as indicated by the left-right split symbols in the graph of Fig. 2. This was needed in order to reduce the number of computation operators on the diagram and to render feasible the task of scheduling the VEIL graph. To compute the determinant for example, the concatenated streams and are multiplied concurrently by a single Multiply operator; the output stream is then split in its left, and right, portions to be subtracted by the Add operator.
4
Results
This implementation was tested on a common set of synthetic and real image sequences. These sequences are widely available for the scientific community and are commonly used for the evaluation of optical flow computation methods and implementations. The “translating tree” and “diverging tree” are sequences of forty 8-bit images with a spatial resolution of 150 × 150 pixels. The velocity
A Pipelined Real-Time Optical Flow Algorithm
377
Fig. 2. Diagram of optical flow vector calculation (cf. equation 5): the Merge operators concatenate several data streams and perform intermediate storage in order to reduce the number of operators in the diagram. Data streams are split in their left and right portions before the last row of additions.
ranges from (0,1.73) pixels/image on the left side to (0,2.30) pixels/image on the right in the translating case and from 0 pixels/image on the centre to 1.4 pixels/image on the left and 2.0 pixels/image on the right in the diverging case. The “Yosemite” sequence has fifteen 8-bit images, with a spatial resolution of 252 × 316 pixels. The velocity is known for every pixel, having a maximum value of 5 pixels/image on the bottom left. The real image sequences are known as “Hamburg taxi”, “NASA coke can”, “Rubic cube” and “SRI trees”. Figure 3 illustrates snapshots of the Yosemite and “Coke can” sequences. The vector fields in Fig.4 represent the computed optical flow for these two sequences. For comparison of optical flow algorithms we used as error measures the angular error relative to the correct image velocities in the synthetic image sequences, defined by [10]: and are the unit velocity vectors of the correct and measured velocity fields, respectively. The mean and standard deviation estimated for all pixels, are the parameters used in the evaluation. We also used image reconstruction root mean square error obtained by linear displacement interpolation, as defined in [15], for both kinds of images. This error is evaluated by the expression:
is the image intensity at the x, location and time instant is the reconstructed image by the interpolation technique referred above, derived from the information locally given by the computed optical flow. M × N are the total
378
M.V. Correia and A. Campilho
Fig. 3. Snapshots of one synthetic and one real image sequence: a) “Yosemite” and b) “Coke can”.
Fig. 4. Optical flow computed for the synthetic and real image sequences: a) “Yosemite” and b) “Coke can”.
number of pixels in the images. Density is defined as the percentage formed by the total number of non-null vectors in the optical flow field divided by the total number of pixels in the image. It can be seen in Table 1 that, when compared to the Barron et al. implementation of the Lucas and Kanade method [12], the angular error is substantially worse in the real-time implementation. This result is mainly due to the smaller filter kernels and the limited integer resolution used in the real-time implementation. The larger kernels of the Barron et al. implementation extend over more image samples hence providing a stronger regularization of the optical flow. When comparing image reconstruction errors we observe, from Tables 1 and 2, that the real-time implementation performs worse in the pure translation case, but it approaches the performance of Barron et al. implementation of the Lucas and Kanade method [12] on all other cases. This can be justified by the fact that the velocity of translation is very close to the kernel sizes of the real-time implementation. Furthermore, the pure translation adheres more closely to the assumption of local constant velocity and benefits from the larger filter kernels
A Pipelined Real-Time Optical Flow Algorithm
379
of the Barron et al. implementation. Densities are also very similar in both implementations. The limited integer resolution of the real-time implementation completely justifies the other small differences in performance errors. On the other hand, the processing time for the Yosemite sequence, with a spatial resolution of 252 × 316 pixels, is 47.8ms per frame for the real-time implementation. The processing time reported by Farnebäck of his method on a Sun Ultra 60 at 360 MHz workstation is 3.5s, while the one by Fleury et al. [7] on a 4 processor parallel system is of 10.7s. Therefore, our implementation presents a considerable gain in processing speed with a small loss in accuracy. Standard video frame-rate can be achieved by using a spatial resolution of pixels or by using additional arithmetic units on the pipeline image processor.
5
Conclusions and Future Work
We described an implementation of an optical flow algorithm on a pipeline image processor. The method is gradient-based and assumes a constant parametric model of velocity fields with spatial and temporal coherence. The deterministic architecture of this implementation achieves very low computation delay, exhibiting a performance superior to implementations reported by others, either in general purpose or parallel hardware. The operation at video frame-rate is also achieved, avoiding the need to store large quantities of image data.
380
M.V. Correia and A. Campilho
We plan to complement the techniques described here with the segmentation of optical flow in order to determine egomotion and motion of other objects in the scene. This will allow to obtain a more compact and reliable characterization of visual motion. The final goal is to provide a real-time, practical tool for the analysis of optical flow in studies of visual motion perception. It may also be applicable to autonomous navigation at the early stages of motion sensing.
References 1. A. T. Smith, R. J. Snowden, Visual Detection of Motion, New York Academic Press, 1994. 2. R. Cummings, Biologically inspired visual motion detection in VLSI, International Journal of Computer Vision 44 (2001) 175–198. 3. Z. Duric, A. Rosenfeld, J. Duncan, The applicability of Green’s theorem to computation of rate of approach, International Journal of Computer Vision 31 (1) (1999) 83–98. 4. N. Stofler, T. Burkert, G. Farber, Real-time obstacle avoidance using MPEGprocessor-based optic flow sensor, in: Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, 2000, pp. 161–166. 5. H. Liu, T.-H. Hong, M. Herman, T. Camus, Accuracy vs efficiency trade-offs in optical flow algorithms, Computer Vision and Image Understanding 72 (3) (1998) 271–286. 6. G. Farnebäck, Fast and accurate motion estimation using orientation tensors and parametric motion models, in: Proceedings 15th Int. Conf. on Pattern Recognition, Barcelona, Spain, 2000, pp. 135–139. 7. M. Fleury, A. F. Clark, A. C. Downton, Evaluating optical-flow algorithms on a parallel machine, Image and Vision Computing 19 (2001) 131–143. 8. M. V. Correia, A. C. Campilho, J. A. Santos, L. B. Nunes, Optical flow techniques applied to the calibration of visual perception experiments, in: Proceedings 13th Int. Conf. on Pattern Recognition, ICPR’96, Vienna, Austria, 1996, pp. 498–502. 9. M. V. Correia, A. C. Campilho, Real-time implementation of an optical flow algorithm, in: Proceedings 16th Int. Conf. on Pattern Recognition, Québec City, Canada, 2002, pp. 247–250. 10. J. L. Barron, D. J. Fleet, S. S. Beauchemin, Performance of optical flow techniques, International Journal of Computer Vision 12 (1) (1994) 43–77. 11. E. P. Simoncelli, Distributed representation and analysis of visual motion, Ph.D. thesis, Massachusetts Institute of Technology (January 1993). 12. B. D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, Canada, 1981, pp. 674–679. 13. D. J. Fleet, K. Langley, Recursive filters for optical flow, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (1) (1995) 61–67. 14. T. J. Olson, J. R. Taylor, R. J. Lockwood, Programming a pipelined image processor, Computer Vision and Image Understanding 64 (3) (1996) 351–367. 15. T. Lin, J. L. Barron, Image reconstruction error for optical flow, in: Proceedings of Vision Interface’94, Banff National Park, Canada, 1994, pp. 73–80.
De-interlacing Algorithm Based on Motion Objects* Junxia Gu, Xinbo Gao, and Jie Li School of Electronic Engineering, Xidian Univ., Xi’an 710071, P.R.China
Abstract. A novel de-interlacing algorithm based on motion objects is presented in this paper. In this algorithm, natural motion objects, not contrived blocks, are considered as the processing cells, which are accurately detected by a new scheme, and whose matching objects are quickly searched by the immune clonal selection algorithm. This novel algorithm integrates many other de-interlacing methods, so it is more adaptive to various complex video sequences. Moreover, it can perform the motion compensation for objects with the translation, rotation as well as the scaling transform. The experimental results illustrate that compared with the block matching method with full search, the proposed algorithm greatly improve the efficiency and performance.
1
Introduction
Nowadays, the interlaced scanning technique is widely adopted in the worldwide television system. But a major drawback of such a scanning fashion is the line flicker and jagged effects of moving edges. Furthermore, an interlaced field is unsuitable for media that uses the progressively scanned format, such as video printing, computers and new technologies like plasma and LCD displays [1]. Thus, many de-interlacing algorithms have been proposed to reduce those artifacts. In general, the existing algorithms can be classified into intra-field, and inter-field techniques [2]. Intra-field algorithms exploit the high correlation between the pixels in the current field and those to be interpolated and do not demand high hardware cost. However, the vertical resolution is halved and the image is blurred for suppressing the high frequency components. While, inter-field algorithms require at least one field memory. This increases the hardware cost, but the increased freedom in algorithms improves the de-interlacing performance. Simple inter-field methods, like Weave, behave very well for static scenes and restores the spatial details, but causes artifacts like “ghosting” in the case of motion. Many techniques, such as median filtering (MF)[2], motion detection[1,3] and so on, have been proposed to make the interpolation adaptive to motion. Motion compensation is one of the advanced methods. Motion compensated (MC) de-interlacing methods with the strongest real physical background attempt to interpolate in the direction with the highest *
This work was supported by the NSF of China under grant No.60202004.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 381–388, 2004. © Springer-Verlag Berlin Heidelberg 2004
382
J. Gu, X. Gao, and J. Li
correlation [2]. In MC algorithm, block-matching (BM) method, in which all pixels of a block are assumed to undergo the same translation, and be assigned the same correspondence vector, has been widely adopted for motion estimation [2,4,5]. The BM algorithm behaves well in the video sequences with horizontal or vertical motion. But the inherent drawback of the assumption restricts the improvement of the performance. Firstly, the block is factitious, so it is not true of the natural edge of the object. Even in the interior of the object, mixed blocks, in which the pixel motion is not the same, also exist. Secondly, the BM method is only adaptive to horizontal or vertical motion, but unsuitable for rotation and scaling transform. In addition, in the matching criteria, only luminance can be used, which affects the precision of motion estimation. For this purpose, a novel de-interlacing algorithm based on motion object (DA-MO) is presented, in which the processing cell is not factitious block but natural motion object (MO). The so-called motion object is referred to a group of adjacent pixels that have the same motion vector in a motion region. In other words, a motion region (MR) maybe consists of some motion objects. In the operation based on motion object, the mixed-block problem can be effectively avoided, and the rotational and scaling objects can also be dealt with. At the same time, many object characters, such as luminance, area, perimeter, Euler number, form factor and etc, can be adopted to construct the matching criteria to improve the precision of object matching.
2
De-interlacing Algorithm Based on Motion Objects
The proposed de-interlacing algorithm follows four steps: motion region detection (MRD), motion object extraction (MOE), motion vector estimation (ME) and motion compensation (MC). The block diagram of this algorithm is shown in Fig.1. The details of each step will be given respectively.
Fig. 1. Block diagram of proposed de-interlacing method
2.1
Motion Region Detection
In this section, the optical flow changes produced by motion appear in the interframe, which contains motion region and the uncovered background, are first detected with higher-order statistics model[6].
De-interlacing Algorithm Based on Motion Objects
383
For interlaced video sequence, same-parity fields are needed to accurately detect motion regions [3]. Let be the absolute difference between two adjacent same-parity fields.
where is the spatial position, denotes the motion regions in field, and is uncovered background of the field relative to the field, and the background noise is modeled as Gaussian random fields. According to [6], the definition of motion detector is given as follows.
where W is a moving window with center at is the number of the pixels in W, and is the sample mean of is the noise variance, and is a constant approximately independent of the sequence characteristics. In motion detection, “over detection” is employed to detect motion as much as possible. Then post-processing with morphological filtering, such as opening and closing, is performed on the motion marker matrix to remove the false alarms. Since the obtained motion marker picture consists of motion regions and uncovered background. To avoid unnecessary motion estimation, the motion regions should be distinguished from the uncovered background. In the motion preliminary detection between the current field and other reference fields, theoretically speaking, the detected motion regions of the current field will appear at the same location in every detection, while the uncovered background will appear at almost different location in every detection. Then the intersection of all motion detection results for the current filed will accurately give the motion regions in theory. Fig.2 shows the experimental result of the fifth field and the sixth field in “golf”sequence.
Fig. 2. Motion region detection. (a) is the field, (b) is the field, (c) is the MRs in field, (d) is the MRs in field, and (e) is the uncovered background
2.2
Motion Object Extraction
Motion vector cannot be obtained before motion estimation, so it is impossible to determine whether the adjacent motion pixels undergo the same motion. We assume that the motion object takes on homogeneity. Then the method for spatial segmentation can be adopted, and afterward motion objects can be obtained by fusing the spatial segmentation and motion region detection.
384
J. Gu, X. Gao, and J. Li
Watershed transform is one of widely used video spatial segmentation methods. However, it tends to make over segmentation. Here, watershed transform is used for preliminary segmentation. Each segmented object is declared as motion object if the percentage of the motion pixels in this object is greater than a threshold Then some post-processing operations are performed on the detected motion objects. Some small objects are merged as over-segmented objects or eliminated as noise.
2.3
Motion Estimation
In the motion estimation module, the key step is motion object matching. If there are a large number of motion objects, it is necessary to select a candidate matching object set (CMOS) based on some basic object characters in the reference field for each motion object in the current field. Here, the following evaluation function is adopted for the selection of CMOS.
where and denote the relative mean luminance difference and the relative area difference between the object in the current field and the object in the reference field respectively. is the weighting factor. means that the object in the reference field will be selected into the CMOS of the object in the current field. Unfortunately, there are some drawbacks in the process of object segmentation. So, one has to search the optimal matching object around each object in the reference field. For every object in the current field, the search process of matching object follows four steps: (1) Select an object in CMOS as the center of the search region; (2) Give a rectangular search region based on the size of the selected object; (3) Consider rotating factor and scaling factor, and then search every point in the search region to get the optimal solution relative to the selected object; (4) If all of the candidate objects in CMOS have been matched, output the optimal search result, otherwise, return to (1). If the optimal matching error is smaller than a threshold it is declared that the object in the current field has a matching object in the reference field. In the above search steps, the luminance is still employed to compute the matching error and the mean absolute difference (MAD) is used as the matching criteria. The global optimal solution can be found by full search, but so large search space leads to time-consuming. Thereby, the immune clonal selection (ICS) algorithm is adopted to accelerate the search process. The key problem to employ ICS Algorithm to search matching object is to design a suitable affinity function and encode the solution to antibody. Here, let the string of the solution be in which every parameter is encoded in binary system. denotes the displacement between the center of and B, and is the rotation factor, and is the scaling factor. The affinity function of object match is defined as follows.
De-interlacing Algorithm Based on Motion Objects
385
with A and being a certain object in current field and a possible matching object transformed from the candidate object B in the corresponding CMOS of A. The defines the matching error between A and and the motion factor of which is defined as
where
2.4
is the number of motion pixels in object
Motion Compensation
The last step is de-interlacing operation based on motion compensation. For different cases, different methods should be adopted. For still background, Weave is the best choice, and for the objects with matching object, motion compensation based on temporal median filtering [2] is employed, and for uncovered background and the objects without matching object, Bob is adopted.
3
Experimental Results and Analysis
This section presents experimental results with the proposed de-interlacing algorithm, in which four video sequences are selected as testbed. The first one is “man” sequence with horizontal motion (http://www.cim.mcgill.ca/ mlamarre /particle_filter.html). The second one is “tennis1” sequence with vertical motion (http://sampl.eng.ohio-state.edu/ sampl/database.htm). The third one is “golf” sequence with complex motion objects and slight background variation. The last is “tennis2” sequence with complex motion objects (http://sampl.eng.ohiostate.edu/ sampl/database.htm). Fig.3 shows one frame for each sequence.
Fig. 3. Test sequences
Here, an objective measurement with progressive sequence is introduced[3]. Two criterions, peak-signal-to-noise ratio (PSNR) and significant error ratio (SER), are selected as evaluation function of de-interlacing performance. The PSNR denotes the whole performance, while SER the significant error to human vision. SER is given by the percentage of the different element between image A and B, whose absolute difference is greater than a given threshold,
where,
is a logic function, and M × N is the size of A and B.
386
J. Gu, X. Gao, and J. Li
In experiment, the value of every threshold is given as follows. Fig.4 shows the curves of the mean PSNR for the whole image in the four sequences obtained by the different de-interlacing algorithms. And Fig.5 shows the curves of the mean PSNR only for the motion regions in the four sequences. The corresponding statistics are shown in Table.1, in which the mean cputime is normalized to the time of block matching method with full search (BM-FS).
Fig. 4. Curves of the PSNR of the whole image
Some conclusions can be reached from Fig.4, Fig.5 and Table. 1. (1) For “man”, the algorithms adaptive to motion are better than Weave and Bob. The mean PSNR of DA-MO is higher than BM-FS and MF; (2) For “tennis1”, DA-MO is much better than BM-FS and MF. In this case, Weave is also a better choice because of the small motion region. But the algorithms adaptive to motion are better than Weave for the motion regions. (3) For “golf”, there is slight variety in the background so that the MF is the best method in the whole. Weave is good for the fields from 15th to 20th with small motion regions. While, DA-MO is better than other methods in the motion regions. The reason that the PSNR varies almost periodically with the field number is that in the original progressive sequences the message in the two marginal successive lines is not homogeneous, and line repetition method is always used for these lines. When we ignore these marginal lines, periodicity phenomena will disappear. (4) For “tennis2” with many distorted objects, the whole performance of the DA-MO and Weave is better, and in motion regions, the DA-MO is better than other methods. In addition, it is found from Table 1 that the SER of the DA-MO is lower than other algorithms. Since a motion object always includes some motion blocks, the
De-interlacing Algorithm Based on Motion Objects
Fig. 5. Curves of PSNR of the motion regions
387
388
J. Gu, X. Gao, and J. Li
computing time of the DA-MO is always smaller than the BM-FS. In a word, any an individual de-interlacing method is not suitable for various complex video sequences. So the DA-MO algorithm combining the motion compensation, MF, Weave and Bob is better choice for all kinds of video sequences.
4
Conclusions
In the de-interlacing algorithms based on MC, the inherent drawbacks of blockmatching limit the further improvement of the performance. So a novel algorithm based on motion objects is presented. An accurate method is introduced to detect the motion objects with the ICS algorithm accelerating the search for matching object. The proposed algorithm integrates many other de-interlacing methods such as MC, MF, Weave and Bob, so it is more adaptive to various video sequences. In addition, it can deal with not only the translation, but also rotation and scaling transform. Of course, the novel algorithm still has some open problems. For example, the object segmentation is not accurate so that many object-characters cannot be used to improve the matching performance. To further enhance the performance of the proposed algorithm, some in-depth study is needed doing in the future work.
References 1. D. Van De Ville, B. Rogge, W. Philips, and I. Lemahieu. Deinterlacing Using Fuzzy-based Motion Detection. Proc. of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems, 1999, pp.263-267. 2. E.B. Bellers and G. de Haan. Advanced De-interlacing Techniques. Proc. ProRISC/IEEE Workshop on Circuits, Systems and Signal Processing, Mierlo, The Netherlands, November 1996, pp.7-17. 3. Shyh-Feng Lin, Yu-Ling Chang and Liang-Gee Chen. Motion Adaptive Interpolation with Horizontal Motion Detection for Deinterlacing. IEEE Trans. Consumer Electronics. 2003, 49(4): 1256-1265. 4. http://www.newmediarepublic.com/dvideo/compression.2003. 5. Yao Nie and Kai-Kuang Ma. Adaptive rood pattern search for fast block-matching motion estimation. IEEE Trans. Image Pro. 2002, 11(12): 1442-1449. 6. A.Neri, S.Colonnese, G.Russo and P.Talone. Automatic Moving Object and Background Separation. Signal Processing.1998,66(2):219-232. 7. Du Hai-feng. Immune Clonal Computing and Artificial Immune Networks. Postdoctoral Research Work Report of Xidian Univ.2003. (in Chinese)
Automatic Selection of Training Samples for Multitemporal Image Classification T.B. Cazes1, R.Q. Feitosa2, and G.L.A. Mota3 1,2,3 2
Catholic University of Rio de Janeiro – Department of Electrical Engineering State University of Rio de Janeiro – Department of Computer Engineering {tcazes,raul,guimota}@ele.puc-rio.br
Abstract. The present work presents and evaluates a method to automatically select training samples of medium resolution satellite images within a supervised object oriented classification procedure. The method first takes a pair of images of the same area acquired in different dates and segments them in homogeneous regions on both images. Then a change detection algorithm takes stable segments as training samples. In experiments using Landsat images of an area in Southwest Brazil taken at three consecutive years the performance of the proposed method was close to the performance associated to the manual selection of training samples.
1 Introduction The remote sensing is one of the most important technologies available for the monitoring large areas in the world. Despite the remarkable advances in the last decades, the interpretation of remotely sensed images is still mainly a visual procedure. There is worldwide a great effort searching for methods, which increase the productivity of photo-interpreters by emulating part of his reasoning in a computer system [1][2][3][4][5][6][7]. The supervised interpretation process can be roughly described by the following sequential steps [8] [9]: 1) image segmentation; 2) selection of training samples; 3) supervised classification; 4) post-editing of the supervised classification result. The steps 1 and 3 are usually performed automatically after the user defines the values of some operation parameters. The steps 2 and 4 are on the contrary essentially manual. Thus any attempt to increase the automation level of the image interpretation processes must focus on the steps 2) and 4). The present paper proposes a method to automate the selection of training samples (step 2). The proposed approach considers as inputs two images of the same area taken in different dates – previous time, and current image – and a reliable classification of the area in the instant t-1, The automatically selected samples will be the segments whose classification did not change between t and t-1. As a A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 389–396, 2004. © Springer-Verlag Berlin Heidelberg 2004
390
T.B. Cazes, R.Q. Feitosa, and G.L.A. Mota
matter of fact, this paper extends the procedure originally proposed in [10] for a pixel wise classification to an object oriented classification approach. Moreover in comparison to the results presented in [10], this paper presents a more thorough performance analysis since it is based on more reliable data. The remaining of this paper is organized as follows. The next section presents the proposed automatic training sample selection procedure. The section 3 presents the experiments carried out to evaluate the proposal. Finally, the section 4 presents the main conclusions.
2 The Automatic Selection of Training Samples The spectral appearances of the land use /land cover (LULC) classes are affected by many factors such as the atmospheric conditions, sensor calibration and land humidity [11]. The supervised classification procedure capture all these factors together by estimating the classifier parameters upon training samples collected from the image itself to be classified. This work uses an objected oriented approach [12] [13] [14] [15][16][17] whereby the objects to be classified are segments instead of pixels. This work uses the average multispectral responses as the only object attribute. In comparison to a pixel wise classification, this proposal is expected to be less sensitive to registration inaccuracies. The herein employed image segmentation procedure takes both images as input so that each segment is homogeneous in the current as well as in the previous image. All segments obtained in the step 1 of the interpretation process are candidates to integrate the training set. The fig. 1 summarizes the procedure of automatic selection of training samples. An automatic change detection algorithm identifies which segment does not change from one class to another between the previous and the current image. Many change detection algorithms have been proposed in the literature (a survey can be found in [18]). Most of them base on the following assumptions. First, the changes in the LULC between t and t-1 are moderate. Second, natural events and varying image acquisition conditions may affect differently each class, but have quite the same effect on all segments of a single class. Finally, the images are assumed to be registered in relation to each other. This work uses a change detection approach described in [10]. Similar methods can be found in [8] [19] [20] [21]. It is assumed that the spectral values of stable segments in the current and in the previous images can be well described by a linear relation. Thus, if the spectral values of one single segment in t and t-1 fit to the linear model, it is considered stable. On the other hand, if the spectral responses do not fit to the linear model it is considered changed. The stable segments make up the training set.
Automatic Selection of Training Samples for Multitemporal Image Classification
391
Fig. 1. The overview of the automatic selection of training samples.
More specifically each class is assumed to follow a regression model, given by
where and are the mean intensity in the segment i in band j in the current and in the previous image respectively, is the regression coefficient for band j, is the error, and p is the number of bands. According to linear regression theory [5] the least square estimate of
is given by
where X is the sample matrix whose n rows are the mean spectral responses of the segments in the previous image and the vector y corresponds to their respective mean spectral intensity in one band of the current image. Evidence whether the model fits well to a segment or not (with significance level of can be drawn by computing the confidence interval for the mean of each error, given by:
and
is given by:
392
T.B. Cazes, R.Q. Feitosa, and G.L.A. Mota
where is the raw residual for the segment i, is the inverse Student’s t cumulative distribution function with n-r-1 degrees of freedom at is the estimated standard deviation of the error and is the ith diagonal element of the matrix H. If the confidence interval given by equation (3) does not include 0 (zero), the corresponding segment is likely to be changed .
3 Experiments This section describes the experiments performed to evaluate the proposed approach.
3.1 Reference Data The experiments used images from the Taquari Watershed, County of Alcinópolis in the State of South Mato Grosso, Brazil. Such images – composites of the bands 5, 4 and 3 of the sensor Landsat TM in the channels R, G and B respectively – are part of the scene 224-073 of the satellite LANDSAT. The images were acquired on August 5, 1999; August 7, 2000; and August 10, 2001.
The classification used as reference for this evaluation was produced visually by an experienced photo interpreter aided by additional data: the previously mentioned images, the drainage map, the digital elevation model and a videography of the selected area performed in 2001. A variation of the watersheds segmentation
Automatic Selection of Training Samples for Multitemporal Image Classification
393
algorithm [22] was used in the experiments. Only segments covered by the videography were used in the evaluation. The legend consists of the following classes: bare soil, riparian forest, pasture, water bodies, dense savanna e dense savanna in regeneration. The number of segments of each class, the amount of changed segments, and the percentage of changed segments for each class in the reference classification are presented in table 1.
3.2 Experiments Design The outcome of a maximum likelihood classifier was used as a performance measure. So the misclassification rate when the training samples are a) selected automatically by the proposed method and b) selected manually by a photo-interpreter is to be compared. For each class the same number of samples were taken for a) and b), whereby the samples associated with the best classification performance were used for the case b). The performance of most change detection algorithms is dependent on the percentage of change between both input images. In order to assess the effect of this factor the number of changed and stable segments used in the evaluation was established so that the amount of change was nearly the same for all classes. Experiments were performed for 1%, 5% and 10% change. The stable and changed segments used as input in the experiment were selected randomly among the available data. In the cases where there was no change for one class all available (stable) segments were taken in the training set. This occurs in the data set for the classes riparian forest, water bodies, dense savanna and dense savanna in regeneration. As input data for the experiments, different pairs of previous-current images – specifically 1999-2000, 2000-2001 and 1999-2001 – were used. For each pair, the experimental results reported in the next section are the mean values of 100 tries. In each try the changed and stable segments were selected randomly.
3.3 Results and Analysis The fig. 2 summarizes the results of the previously described experiments. Each bar in that figure presents the percentage of missclassification. The gray bars refer to the automatically selected training set while the dark gray bars refer to the manually selected training set. A pair of consecutive bars (automatic-manual) is associated to a given set of input parameters – image pairs, and amount of change for all classes respectively.
394
T.B. Cazes, R.Q. Feitosa, and G.L.A. Mota
Another aspect investigated in the experiments was the sensibility of the classification performance to the confidence level used in the change detection algorithm. The values 10%, 30% and 50% were used in these experiments.
Fig. 2. Experimental results
Analyzing the fig. 2, it can be observed that, in general, the classification error of both methods is similar. In most cases the difference is inferior to 2 %. Considering the influence of the amount of change in the training set, the automatic selection tends to perform better than the manual selection for 1 % of change, while the manual selection performed better for 10% of change. This is not unexpected since the automatic method assumes a low amount of change. In respect to the parameter it can be observed that its influence on the performance decreases as the amount of change increases. For 1 % of change the influence of this parameter is bigger than for 5 % and 10 % of change. It worth mentioning that for 1 % of amount of change the automatically selected training set performed better than
Automatic Selection of Training Samples for Multitemporal Image Classification
395
the manually selected training set. This fact enforces the assumption that the amount of error must be small.
4. Conclusions The present paper proposed and evaluated a method to automatically select training samples in the context of the supervised classification of remotely sensed data. The method takes as input two images of the same area obtained at different dates and a reliable classification of one of them. A change detection algorithm is applied to both images and stable segments are taken as training patterns. Experiments using images of the Taquari Watershed region in Southwest Brazil taken at three consecutive years indicated that the proposed automatic method has a performance close to that of the manual training set selection. The performance of a maximum likelihood supervised classifier was used to compare the automatic and the manual methods. In most cases the differences in the classification error is smaller than 2 %. The experiments showed also that the method is quite robust to variation of its parameters. This method can be used in a semi automatic fashion by highlighting “good” candidates to the training set, which would be lastly selected by the operator. Variations are also conceivable where each object would additionally show the level of certainty that it remains in the same class as in the classification of the previous date. These are possible uses of the proposed method which could be a valuable aid to the photo interpreter in the task of selection training samples. Acknowledgement. This work was supported by CAPES and DAAD within the PROBAL program.
References 1
2. 3. 4.
MCKEOWN, D., M., HARVEY, W., A., MCDERMOTT, J. Rule Based interpretation of aerial imagery, IEEE Transactions on Pattern Analysis and Machine Intelligence (1985) , v. 7, n. 5, p. 570-585 MATSUYAMA, T., HWANG, V. SIGMA, a knowledge-base aerial image understanding system Advances in computer vision and machine intelligence, New York: Plenum (1990). CLÉMENT, V, GIRAUDON, G., HOUZELLE, S., SANDAKLY, F. Interpretation of Remotely Sensed Images in a Context of Mutisensor Fusion Using a Multispecialist Archteture, IEEE Transactions on Geoscience and Remote Sensing (1993) Vol 31, No. 4 NIEMANN, H., SAGERER, G., SCHRÖDER, S., KUMMERT, F. ERNEST: A Semantic Network System for Pattern Understanding IEEE Transactions PAMI (1990) Vol 12, No 9, set 1990
396
5. 6.
7. 8. 9. 10.
11. 12.
13.
14. 15.
16. 17. 18. 19. 20. 21. 22.
T.B. Cazes, R.Q. Feitosa, and G.L.A. Mota LIEDTKE, C.-E. AIDA: A System for the Knowledge Based Interpretation of Remote Sensing Data, Proceedings of the Third International Airborne Remote Sensing Conference and Exhibition, Copenhagen, Dinamanca, (1997) BÜCKNER, J., STAHLHUT, O., PAHL, M., LIEDTKE, C.-E. GEOAIDA - A Knowledge Based Automatic Image Data Analyzer for Remote Sensing Data. In: ICSC Congress on Computational Intelligence Methods and Applications 2001 – CIMA 2001, Bangor. Proceedings of The Congress on Computational Intelligence Methods and Applications 2001 – CIMA 2001. Bangor, Wales, Reino Unido (2001) LIEDTKE, C.-E., BÜCKNER, J., PAHL, M., STAHLHUT, O. Knowledge Based System for the Interpretation of Complex Scenes, Ascona, Suiça (2001) MATHER, P.M. Computer Processing of Remotely-Sensed Images. An Introduction. Ed.Wiley, Second edition (1999) RICHARDS, J.A.,JIA X., Remote Sensing Digital Image Analysis – An Introduction, 3rd Ed. Springer Verlag (1999) FEITOSA, R.Q., MEIRELLES, M.S.P., BLOIS, P.A., Using Linear Regression for Automation of Supervised Classification in Multitemporal Images. Proceedings of the first International workshop on analysis of multi-temporal remote sensing images – MultiTemp 2001, Trento, Italy (2001) CHAVEZ, P. S. Jr, Image –Based Atmospheric Corrections – Revisited and Improved, PE&RS (1996) pp. 1025-1036 BENZ, U.C., P. HOFMANN, G.WILLHAUCK, I. LINGENFELDER, M. HEYNEN (2004): Multi-resolution, object-oriented fuzzy analysis of remote sensing data for GISready information. In: ISPRS Journal of Photogrammetry & Remote Sensing 58 (2004) 239-258 CORR D. G., BENZ U., LINGENFELDER I., WALKER A. and RODRIGUEZ A.,Classification of urban SAR imagery using object oriented techniques, In: Proceedings of IGARSS 2003 IEEE Toulouse (2003), Session: Information Extraction from high resolution SAR Data DARWISH, A., K. LEUKERT, W. REINHARDT, Image Segmentation for the Purpose Of Object-Based Classification. In: Proceedings of IGARSS 2003 IEEE , Toulouse (2003) ANDRADE, A., BOTELHO, M. F., CENTENO, J., Classificação de imagens de alta resolução integrando variáveis espectrais e forma utilizando redes neurais artificiais. In: XI Seminário Brasileiro de Sensoriamento Remoto, Belo Horizonte, Brazil (2003) pp. 265-272 DARVISH, A.,LEUKERT, K.,REINHARDT, W., Image Segmentation for the Purpose of Object-Based Classification, In: International Geoscience and Remote Sensing Symposium, Toulouse, France (2003) YAN, G., Pixel based and object oriented image analysis for coal fire research. Master Thesis, ITC, Netherlands (2003) COPPIN, P., LAMBIN, E., JONCKHEERE, I., MUYS, B., Digital change detection methods in natural ecosystem monitoring: A review, In: Analysis of multi-temporal remote sensing images, Proceedings of Multitemp 2001, Trento, Italy (2001) FUNG, T., LEDREW, E. Application of principal components analysis to change detection. Photogramm. Eng. Remote Sensing. 53 (1987) pp. 1649-1658 BANNER , A., LYNHAM, T. Multi-Temporal analysis of Landsat data for forest cutover mapping – a trial of two procedures. Proc. 7th Can. Symp. On Remote Sensing, Winnipeg, Canada (1981) pp.233-240 MALILA, W.A. Change vector analysis: an approach to detecting forest changes with Landsat. Prc. 6th Int. Symp. On Machine Processing of Remote Sensing of Environment. Ann Arbor, Michigan, (1987) pp. 797-804 GONZALES, R.C.,Woods, R.E., Digital Image Processing Reading, MA: AddisonWesley (1992)
Parallel Computation of Optical Flow Antonio G. Dopico1, Miguel V. Correia2, Jorge A. Santos3, and Luis M. Nunes4 1
2
Fac. de Informatica, U. Politecnica de Madrid, Madrid.
[email protected] Inst. Engenharia Biomédica, Univ. Porto, Fac. de Engenharia.
[email protected] 3 Inst. de Educação e Psicologia, Univ. Minho, Braga.
[email protected] 4 Dirección General de Tráfico, Madrid.
[email protected]
Abstract. This paper describes a new parallel algorithm to compute the optical flow of a video sequence. A previous sequential algorithm has been distributed over a cluster. It has been implemented in a cluster with 8 nodes connected by means of a Gigabit Ethernet. On this architecture, the algorithm, that computes the optical flow of every image on the sequence, is able of processing 10 images of 720 × 576 pixels per second. Keywords: Optical Flow, Distributed Computing
1
Introduction
There is a wide variety of areas of interest and application fields (visual perception studies, scene interpretation, motion detection, in-vehicle inteligent systems etc.) that can benefit from optical flow computing. The concept of optical flow derives from a visual system concept analogue to the human retina, in which a 3D world is represented in a 2D surface by means of an optical projection. In the present case, we will use a simplified 2D representation consisting in a matrix of pixels in which only the grey levels of luminance are considered. Spatial motion and velocity is then represented as a 2D vector field showing the distribution of velocities of apparent motion of the brightness pattern of a dynamic image. The optical flow computation of a moving sequence is an intensive demanding application both in memory and computational terms. As the computers performance improves the users expectations raises too: higher resolution video recording systems allow to reduce the negative effects of spatial and temporal motion aliasing. In [1] synthetic images with 1312 × 2000 pixels at 120 Hz are used. Given the growing need of computer performance the parallelization of the optical flow computation appears as an interesting alternative to achieve a massive processing of long video sequences. This idea of parallelization, proposed some years ago with four processors [2], obtained very modest results: processing up to 7-8 images of 64 × 64 pixels per second—too small resolution to be useful. More recently, [3] proposes the decomposition of the optical flow computation in small tasks: by dividing the image in independent parts, the parallelization becomes easier to approach, although with the drawback of the overheads associated with dividing the images and grouping the obtained results. As this has not been yet implemented no results are available. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 397–404, 2004. © Springer-Verlag Berlin Heidelberg 2004
398
A.G. Dopico et al.
A possible alternative to parallellization could be to simplify drastically the optical flow algorithm. Such an alternative is presented in [4], based on additionssubtractions that needs much less computational resources but, according to the authors, with the pay-off of incorrect results. In the present work the parallelization of the optical flow computation is approached with the objective of maximizing performance with no loss in the quality of the results, allowing to massively compute long video sequences using standard resolutions. The gain due to parallelization will be referred to a sequential version of an equivalent algorithm.
2
Optical Flow Sequential Computation Algorithm
Following the survey of Barron et al. [5], the method of Lucas and kanade [6], has been chosen to compute optical flow. This method provides the best estimate at a low computational cost, which is also confirmed by the work of Galvin et al.[7]. This method was also used on previous work by the authors [8,9].
2.1
Lucas Optical Flow Algorithm
In Lucas and Kanade’s method, optical flow is computed by a gradient based approach, following the common assumption that image brightness remains constant between time frames. This can be expressed by the motion constraint equation, in compact form:
where and represent the spatial gradient and temporal derivative of image brightness, respectively, and indicates second order and above terms of the Taylor series expansion. In this method, the image sequence is first convolved with a spatio-temporal Gaussian to smooth noise and very high contrasts that could lead to poor estimates of image derivatives. Then, according to Barron et al. implementation, the spatio-temporal derivatives and are computed with a four-point central difference. Finally, a weighted least-squares fit of local first-order constraints, assuming a constant model for v in each small spatial neighborhood provides the two components of velocity that are obtained by:
where, for
points
at a each instant and
Simoncelli [10] presents a Bayesian perspective of the least squares solution. He models the gradient constraint using Gaussian distributions. This modification allows to identify unreliable estimates using the eigenvalues of
Parallel Computation of Optical Flow
2.2
399
Implementation
The implementation first smoothes the image sequence with a spatio-temporal Gaussian filter to attenuate temporal and spatial noise as do Barron et al. [5]: Temporal smoothing Gaussian filter with requiring frames, the current frame, past frames and future frames. Spatial smoothing Gaussian filter with requiring pixels, the central pixel and pixels for each side relative to this central pixel. This symmetric Gaussian filter in one dimension is applied twice, first in the X direction and then in the Y direction. After the smoothing, the spatio-temporal derivatives are computed with 4-point central differences with mask coefficients: Finally, the velocity is computed from the spatio-temporal derivates: A spatial neighborhood of 5 × 5 pixels is used for the velocity calculations. A weight matrix identical to Barron [5], i.e., with 1-D weights of (0.0625,0.25,0.375,0.25,0.0625) is also used for the velocity calculations. The noise parameters used are and Velocity estimates where the highest eigenvalue of is less than 0.05 is considered unreliable and removed from the results [5].
2.3
Application to Half-Frames in an Interlaced Camera
When using standard video recording systems the optical flow computation poses several practical problems when attempting to analyse high speed scenarios. A temporal resolution of 25 frames per second is not enough as [1] already advanced. We were particularly interested in a series of experiments in which a video camera installed inside a moving car captures the road scene, acquiring images of 720 × 576 pixels. At 90 Km/h (25 m/s) and a frame rate of 25 frames/s, the car advances 1 m betwen two consecutive frames, which does not guarantee a correct sampling of the variations in the scenes brightness patterns and, therefore, the image sequence will present temporal aliasing. Moreover, an interlaced camera poses additional constraints regarding the usefullness of its spatial and temporal resolution: as a full frame does not correspond to a single time sample this can lead to additional noise when full frames are used. However we suggest a procedure to use all the available information as an attempt to optimize the results and minimizing the inconvenient of interlaced images. This procedure consists in rebuilding the video sequence by using successive half-frames as Fig. 1 shows. Current interlaced cameras grab a field (half-frame) every 20 ms and, merging two interlaced fields (odd and even), a frame is composed, that is, a complete image is generated every 40 ms. Each image is composed of two fields separated 20 ms. An intermediate image, separated 20 ms from the original ones can be composed using two consecutive fields of different frames. Of course, the even
400
A.G. Dopico et al.
Fig. 1. Reducing the temporal aliasing
and odd lines must not be displaced in the new image. In this way, 50 images per second can be obtained. Figure 2 shows two images of an interlaced video sequence with 720 × 576 pixels, that have been processed with the described algorithm. The car on the left is moving faster than the car on the center and the car on the right is moving slower than the car on the center.
Fig. 2. Video sequence: frames 19 and 29
Figure 3 shows, on the left, the computed optical flow using 25 frames per second, with a lot of noise due to the overtaking car, that is moving fast (temporal aliasing) and close to the camera (spatial aliasing). The same figure, on the right, shows the computed optical flow using the new video sequence built using halfframes, with significantly less noise than the one on the left.
3
Parallelization of Optical Flow Computing
The execution time of the different tasks of the sequential algorithm were measured to obtain an estimation of its weights. These measures were taken on a workstation with an Intel Xeon 2.4 GHz and 1 GB of main memory, though the important data are not the absolute times but their relationship among the different tasks.
Parallel Computation of Optical Flow
401
Fig. 3. Optical flow computed for frame 24, left: using 25Hz and right: using 50 Hz
3.1
Parallel Algorithm
The parallelization of the sequential algorithm is explained as follows, the execution times indicated are spent with each image in the video sequence: The temporal smooth, in T, is slower than the others because it works with a high number of images. Moreover, it has to read them from disk (12 ms). The spatial smooth in X employs 8 ms. The spatial smooth in Y employs 7 ms. Probably the difference is because now the image is in cache memory. Computation of the partial derivatives, 10 ms. Computation of the velocity of each pixel and writing the results to disk, 130 ms. This is more than triple the time spent by the rest of the tasks. Unlike in [3], the images have not been divided to avoid the introduction of unnecessary overheads, because in that case they had to be divided, then processed and finally group the results. Moreover, the possible boundary effects should be taken into account. To structure the parallelization, the first four tasks are connected as a pipeline because they need the data of several images to work properly. The last one only needs a single image and it is actually independent. The fourth task will send derivatives from complete images to different copies of task five in a circular way. Although a 8 nodes cluster has been used for the implementation, the followed scheme is flexible enough to be adapted to different situations: Four nodes. The first one executes all the tasks except computing the velocity of the pixels (37 ms). The rest of the nodes compute the velocities and, when they finish with an image, they start with the next one (130/3 = 43 ms per node). One image would be processed every 43 ms (maximum of 37 and 43). Eight nodes. The first node computes the temporal smooth and the spatial smooth for the X co-ordinate (12+8=20 ms). The second one computes the spatial smooth for the Y co-ordinate and the partial derivatives (7+10=17 ms). The rest of the nodes compute the velocities (130/6=21 ms). An image is processed every 21 ms (maximum of 20, 17 and 21). Sixteen nodes. The first four nodes are dedicated to the first four tasks (12, 8, 7 and 10 ms respectively). The rest of the nodes compute the velocities (130/12=11 ms). An image would be processed every 12 ms (maximum of 12, 8, 7, 10, 11).
402
A.G. Dopico et al.
In the three cases, the communication time has to be added. This time would depend on the net (Gigabit, Myrinet, etc.) but in every case it has to be taken into account and it will employ several milliseconds. Alternativelly, the same scheme could be used on a shared memory tetraprocessor and the tasks would be distributed in the same way than with a four nodes cluster. With more than 16 nodes, there are not enough tasks to distribute. To obtain a higher degree of parallelism the images would be divided as [3] proposes.
3.2
Cluster Architecture
A cluster with 8 biprocessor nodes (2.4 GHz, 1GB RAM) running Linux (Debian with kernel 2.4.21) and openMosix has been used. The nodes are connected using a Gigabit Ethernet switch. This distributed memory architecture was chosen because it is not expensive, it is easy to configure and it is broadly extended.
3.3
Implementation
The tasks of the previously described algorithm have been assigned to the different nodes of the cluster. For communications, we used LAM/MPI version 6.5.8—the open source implementation of the message passing interface standard (MPI) from the University of Indiana. Non blocking messages were used, in a way that the computation and the communications are overlapped. Moreover, the use of persistent messages avoids the continuous creation and destruction of the data structures used by the messages. This has been possible because the communication scheme is always the same. The information that travels between two given nodes has always the same structure and the same size, so the message backbone can be reused. Regarding the non blocking messages, a node, while processing the image i, has already started a non blocking send to transfer the results of processing the previous image i-1 and has also started a non blocking receive to simultaneously gather the next image i+1. This allows simultaneously send, receive and compute in each node. The scheme for task distribution among the nodes was the following. Figure 4 shows the distribution of the tasks among the nodes. Node 1. Executes the following tasks: Reads the images of the video sequence from the disk. Executes the temporal smooth. The current image, the twelve previous and the twelve next ones are used. Executes the spatial smooth for the x co-ordinate. Sends to node 2 the image smoothed in t and x. Node 2. Executes the following tasks: Receives the image from node 1. Executes the spatial smooth for the y co-ordinate. Computes the partial derivative in t of the image. To do that five images are used, the current one, the two previous and the two next ones. So, if the image i is received, the derivative in t of the image i-2 is computed.
Parallel Computation of Optical Flow
403
Fig. 4. Tasks distribution
Computes the partial derivatives in x and y of the image. Sends the computed derivatives It, Ix and Iy to the next nodes (from 3 to 8) in a cyclic mode. When it reaches node 8, it starts again in node 3. Rest of the nodes. Execute the following tasks: Receive the partial derivatives in t, x and y of the image, It, Ix and Iy. Using the derivatives, computes the velocity of each pixel as (vx, vy). Write the computed velocities to disk.
3.4
Results
With this parallelization scheme and using the cluster employing 8 nodes described above, the computation of the optical flow is achieved at 30 images per second with images of 502 × 288 pixels. For images of 720 × 576 pixels the speed obtained is 10 images per second. Note that the optical flow, in both cases, is computed for every image in the video sequence without skipping any one.
4
Conclusions and Future Work
This paper presents a new distributed algorithm for computing the optical flow of a video sequence. This algorithm is based on a balanced distribution of its tasks among the nodes of a cluster of computers. This distribution is flexible and can be adapted to several environments, with shared memory as well as with distributed memory. Moreover, it is easily adaptable to a wide range of nodes number: 4, 8, 16, 32 or more. The algorithm has been implemented on a cluster with 8 nodes and a Gigabit Ethernet, where 30 images per second can be processed with resolutions of 502 × 288 pixels, or 10 images per second if the resolutions are of 720 × 576 pixels. This represents a performance speedup of 6 compared to the sequential version of the algorithm. Also, a method is proposed for increased temporal resolution that is particularly beneficial in complex scenarios with high speed motion. This additional
404
A.G. Dopico et al.
algorithm, applied prior to the optic flow computation, actually doubles the frame rate and reduces the particular motion aliasing pattern of current interlaced cameras. Taking into account the modest performance obtained in [2] with four processors (7-8 images per second with images of 64 × 64 pixels), or the inconvenients of the simplified algorithms [4] the results obtained with the algorithm proposed here are very satisfactory. The interesting parallelization proposed in [3] cannot be compared because it is not yet implemented. The performance obtained brings important advantages. Working with longer sequences, larger images (1280 ×1024 pixels or even larger) and higher frequencies is now feasible. Regarding real-time applications, by connecting the video signal directly to one of the nodes of the cluster and digitizing the video sequence on the fly, the current implementation of the algorithm allows online optical flow calculation of images of 502 x 288 pixels at 25 to 30 Hz.
References 1. Lim, S., Gamal, A.: Optical flow estimation using high frame rate sequences. In: Proceedings of the International Conference on Image Processing (ICIP). Volume 2. (2001) 925–928 2. Valentinotti, F., Di Caro, G., Crespi, B.: Real-time parallel computation of disparity and optical flow using phase difference. Machine Vision and Applications 9 (1996) 87–96 3. Kohlberger, T., Schnrr, C., Bruhn, A., Weickert, J.: Domain decomposition for parallel variational optical flow computation. In: Proceedings of the 25th German Conference on Pattern Recognition, Springer LNCS. Volume 2781. (2003) 196–202 4. Zelek, J.: Bayesian real-time optical flow. In: Proceedings of the 15th International Conference on Vision Interface. (2002) 266–273 5. Barron, J., Fleet, D., Beauchemin: Performance of optical flow techniques. International Journal of Computer Vision 12 (1994) 43–77 6. Lucas, B., Kanake, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI). (1981) 674–679 7. Galvin, B., McCane, B., Novins, K., Mason, D., Mills, S.: Recovering motion fields: An evaluation of eight optical flow algorithms. In: Proceedings of the 9th British Machine Vision Conference. (1998) 195–204 8. Correia, M., Campilho, A., Santos, J., Nunes, L.: Optical flow techniques applied to the calibration of visual perception experiments. In: Proceedings of the 13th Int. Conference on Pattern Recognition, ICPR96. Volume 1. (1996) 498–502 9. Correia, M., Campilho, A.: Real-time implementation of an optical flow algorithm. In: Proceedings of the 16th Int. Conference on Pattern Recognition, ICPR02, Volume IV. (2002) 247–250 10. Simoncelli, E., Adelson, E., Heeger, D.: Probability distributions of optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition. (1991) 310–315
Lipreading Using Recurrent Neural Prediction Model Takuya Tsunekawa, Kazuhiro Hotta, and Haruhisa Takahashi The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585 Japan {tune,hotta,takahasi}@ice.uec.ac.jp
Abstract. We present lipreading using recurrent neural prediction model. Lipreading copes with time-series data like speech recognition. Therefore, many traditional methods use Hidden Markov Model (HMM) as the classifier for lipreading. However, in recent years, a speech recognition method using Recurrent Neural Prediction Model (RNPM) is proposed, and good result is reported. It is expected that RNPM also gives the good result for lipreading, because lipreading has the similar properties with speech recognition. The effectiveness of the proposed method is confirmed by using 8 words captured from 5 persons. In addition, the comparison with HMM is performed. It is confirmed that the comparable performance is obtained.
1 Introduction Lipreading by computer is to classify the words by only image sequences around speaker’s mouth. Since visual-information is also constant in noisy environments, it is helpful to improve the performance of speech recognition [1,2,3,4]. Furthermore, lipreading is also useful to the communications between deaf/hard of hearing people and no-deaf people. Since lipreading copes with time-series data like speech recognition, the robustness to expansion and contraction of time is required. Therefore, many traditional methods use Hidden Markov Model (HMM) as the classifier for lipreading [1,2,3]. However, in recent years, a speech recognition method using Recurrent Neural Prediction Model (RNPM) is proposed, and good result is reported [5]. RNPM prepares a recurrent neural network (RNN) for each category. The RNN of each category is trained to predict the feature vector at time from the feature vector at time The classification is performed by using the prediction error of each RNN. It is expected that RNPM also gives the good result for lipreading, because lipreading has the similar properties with speech recognition. In this paper, lipreading using RNPM is proposed. To classify the words by only image sequences, the motive information around speaker’s mouth is extracted by using optical flow. By using optical flow, the independent classification of lip shape and face fungus is expected [4]. However, the optical flows which are the flow velocities at each point of image sequence are redundant. Therefore, Principal Component Analysis (PCA) is used to reduce the redundancy and extract the essential features. The obtained features are A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 405–412, 2004. © Springer-Verlag Berlin Heidelberg 2004
406
T. Tsunekawa, K. Hotta, and H. Takahashi
fed into RNPM. In this paper, Very Simple Recurrent Network Plus (VSRN+) is used as RNN (predictor) of each category. Since VSRN+ gives the superior performance than Jordan and Elman network [5], it is expected that the proposed method gives comparable performance with HMM. The image sequences of 8 words are captured from 5 persons. The effectiveness of the proposed method is confirmed by using those sequences. In addition, the proposed method is compared with HMM. It is confirmed that the comparable performance is obtained. This paper is structured as follows. Section 2 describes the feature extraction for lipreading. The RNPM based on VSRN+ is explained in section 3. Section 4 shows the experimental results. Conclusions and future works are described in section 5.
2
Feature Extraction
In this paper, optical flow and Principal Component Analysis (PCA) are used to extract the features for lipreading. To classify the words by only image sequences, the motive information around speaker’s mouth is extracted by computing optical flow. By using optical flow, the independent classification of lip shape and face fungus is expected [3,4]. However, optical flows obtained at each point of input image are redundant. Therefore, PCA is used to reduce the redundancy and extract the essential features. Section 2.1 explains optical flow. PCA is described in section 2.2.
2.1
Optical Flow
Optical flow is defined as the distribution of apparent velocities of brightness pattern movements. In this paper, optical flow is computed with gradient based algorithm [6]. Let the image brightness at the point in the image plane at time be denoted by By assuming that the brightness of each point is constant during dt, optical flow constraint equation is defined by
where is flow velocity. and are the partial derivatives of with respect to and In fact, equation (1) is not true because of noise. In addition, the number of unknown parameters is more than the number of equations. By adding the assumption that flow velocities throughout the local region are constant, we seek the flow velocity which minimizes
By computing partial differentiation of equation (2),
Lipreading Using Recurrent Neural Prediction Model
are obtained. Therefore flow velocity
407
is obtained by
In the following experiment, the local region is defined as 3 × 3 pixels. Optical flow is computed at the interval of 1 pixel.
2.2
Principal Component Analysis (PCA)
PCA is one of the methods to reduce the dimension of feature vectors [7]. It reduces the dimension by projecting into the subspace in which the scatter is largest. The covariance matrix is computed from the feature vectors. Projection matrix A is obtained by computing the following eigenvalue problem
where is a diagonal matrix of eigenvalues. The jth column of A is the eigenvector corresponding to the jth largest eigenvalue. The eigenvalue means the variance of each principal component. The projection matrix consists of only eigenvectors with largest eigenvalues are used to reduce the redundancy.
3
Proposed Classifier for Lipreading
In Recurrent Neural Prediction Model (RNPM), a recurrent neural network (RNN) which is robust to expansion and contraction of time is prepared for each category and used as the predictor. The classification is performed by using the prediction error of each RNN. In this paper, Very Simple Recurrent Network Plus (VSRN+) is used as RNN. In section 3.1, RNPM is described. Section 3.2 explains VSRN+. The learning algorithm of VSRN+ is also described.
3.1
Recurrent Neural Prediction Model (RNPM)
RNPM prepares a RNN for each category. The RNN is trained to predict feature vector at time from feature vector at time To classify the input sequence, the sum of squared difference between actual feature vectors and predicted feature vectors is computed. The feature vectors extracted from input sequence are denoted by When the feature vector predicted by RNN from is denoted by the sum of squared difference is defined by
408
T. Tsunekawa, K. Hotta, and H. Takahashi
where is the prediction error of category Figure 1 shows the classification by the RNPM. Test pattern sequence is fed into each RNN. The prediction error of each RNN is computed. The test pattern sequence is classified into the category given the lowest prediction error. Namely, the classification is performed by
Fig. 1. Classification by using RNPM.
3.2
Very Simple Recurrent Network (VSRN+)
In this paper, VSRN+ is used as the RNN. Uchiyama et al. [5] reported that VSRN+ has superior performance to Jordan [8] and Elman recurrent network [9]. The architecture of VSRN+ is shown in Figure 2. Hidden and output layer have each context layer. Each hidden neuron is connected with all input neurons and its context neuron. Similarly, each output neuron is connected with all hidden neuron and its context neuron. The property of VSRN+ is to use the recurrent values from hidden and output neurons at all previous times with delay rate The value of ith neuron of output layer at time is obtained by
where is the output from ith hidden neuron at time is the weight between jth hidden neuron and ith output neuron, is the recurrent value from ith context neuron at time and is the number of hidden neuron. The activation function is sigmoid. Similarly, the output from hidden neuron is also obtained. In this paper, it is assumed that the weights to context neurons and are constant. By this assumption, VSRN+ is regards as a feedforward neural network. Therefore, backpropagation algorithm is used to train VSRN+. In the following experiment, the weights of each VSRN+ are initialized randomly and the weights are estimated by backpropagation.
Lipreading Using Recurrent Neural Prediction Model
409
Fig. 2. The architecture of VSRN+.
4
Experiment
The image sequences of 8 words (aka, ao, ki, kuro, shiro, cha, midori, murasaki) which are color’s names in Japanese are captured from 5 persons under the same environment. The image sequences are captured at 30 frames per second. The number of the image sequences is 400 (5 speakers × 8 words × 10 sequences). The region around speaker’s mouth is cropped manually. Examples of the cropped image sequence are shown in Figure 3. The upper sequence isgaka” and the lower sequence isgki”. The length of one image sequence is between 40 and 180. The size of each image is 50 × 48 pixels. In the following experiment, these image sequences are divided into 3 sets. The first set consists of 200 sequences (5 speakers × 8 words × 5 sequences). This set is used for training the classifier. The second set which consists of 80 sequences (5 speakers × 8 words × 2 sequences) is used as the validation set. The validation set is used for avoiding the over fit. The third set is used for evaluating the performance. This division of database is performed 10 times randomly. The generalization ability is evaluated by the average of 10 trials. The features are extracted from these image sequences by using optical flow and PCA. We investigate the trajectory of each category in principal component space. Figure 4 shows the trajectories of “aka” and “ki” in first and second principal component axis. We understand the differences of trajectories. In the following experiments, the dimension of principal component space is set to 20 so as to cover about 60% of the variance in the training data.
4.1
Experimental Result and Analysis
RNPM has two parameters; delay rate and the number of hidden neurons. First, is fixed to 0.8 which gives the good performance in the preliminary
410
T. Tsunekawa, K. Hotta, and H. Takahashi
Fig. 3. Examples of the cropped images sequences. The upper sequence is “aka” and the lower sequence is “ki”.
Fig. 4. The trajectories of “aka” and “ki” in first and second principal component axis.
experiment. The performance is evaluated by changing the number of hidden neurons. The best performance is obtained when the number of hidden neurons is 30. Next, the number of hidden neurons is fixed to 30. The performance is evaluated by changing The results are shown in Table 1. From Tabel 1, we understand that the proposed method can recognize the words by using only image sequences. The best performance is obtained when is 0.75. The effectiveness of the proposed method is confirmed. Next, the proposed method is compared with left-to-right HMMs [10,11] which use Gaussian mixture models. The emission probabilities of the HMM are initialized by using k-means. The transition probabilities are initialized randomly. These parameters are estimated by using Baum-Welch algorithm. The classification is performed by using Viterbi algorithm. HMM also has two parameters; the number of states Q and the number of
Lipreading Using Recurrent Neural Prediction Model
411
Gaussians of emission probability First, is fixed to 1 and Q is changed. The best performance is obtained when Q is 4. Next, the performance is evaluated by changing when Q is fixed to 4. The results are shown in Table 2. The best performance is obtained when is 5. From Table 1 and 2, we understand that the comparable performance is obtained.
5
Conclusions and Future Works
We presented lipreading using RNPM. The effectiveness of proposed method is confirmed by using 8 words captured from 5 persons. In adding, the comparison with HMM is performed. The comparable performance is obtained. In this paper, in computing the prediction error of each RNN, the same weight is assigned to the features of large and small motion. It is expected that the performance is improved by changing the weight of features according to the magnitude of motion in computing the prediction error. In addition, we can use the weight decay [7] to improve the generalization ability. Since the proposed method is the
412
T. Tsunekawa, K. Hotta, and H. Takahashi
general framework, it can be applied to the other recognition tasks e.g. gesture recognition.
References 1. A.Rogozan and P.Deléglise, Adaptive fusion of acoustic and visual sources for automatic speech recognition, Speech Communication, vol.26, no.1-2, pp.149-161, 1998. 2. G.Potamianos, C.Neti, G. Iyengar and E.Helmuth, Large-Vocabulary AudioVisual Speech Recognition by Machines and Humans, Proc. Eurospeech, 2001. 3. K.Iwano, S.Tamura and S.Furui, Bimodal Speech Recognition Using Lip Movement Measured by Optical-Flow Analysis, Proceedings International Workshop on Hands-Free Speech Communication, pp.187-190, 2001. 4. K. Mase and A. Pentland, Lipreading by optical flow, Systems and Computers in Japan, vol.22, no.6, pp.67-76, 1991. 5. T.Uchiyama and H.Takahashi, Speech Recognition Using Recurrent Neural Prediction Model, IEICE Transactions on Information and Systems D-II, vol.J83-DII, no.2, pp.776-783, 2000 (in Japanese). 6. B.D.Lucas and T.Kanade, An Iterative Image Registration Technique with an Application to Stereo Vision, Proceedings of Imaging Understanding Workshop, pp.121-130, 1981. 7. R.O.Duda, P.E.Hart and D.G.Stork, Pattern Classification Second Edition, John Wiley & Sons, Inc., 2001. 8. M.Jordan, Serial order: A Parallel Distributed Processing Approach, Technical report ICS, no.8604, 1986. 9. J.L.Elman, Finding structure in time, Cognitive Science, vol.14, pp.179-211, 1990. 10. L.Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Proc. of IEEE, vol.77, no.2, pp.257-286, 1989. 11. X.D.Huang, Y.Ariki and M.A.Jack, Hidden Markov Models for Speech Recognition, Edinburgh Univ Press, 1990.
Multi-model Adaptive Estimation for Nonuniformity Correction of Infrared Image Sequences* Jorge E. Pezoa and Sergio N. Torres Department of Electrical Engineering, University of Concepción. Casilla 160-C, Concepción, Chile. {jpezoa,storres}@die.udec.cl http://nuc.die.udec.cl
Abstract. This paper presents a multiple model parallel processing technique to adaptively estimate the nonuniformity parameters of infrared image sequences. The approach is based on both an optimal recursive estimation based on a fast form of the Kalman filter, and a solution for the uncertainties on the system model by running a bank of those estimators in parallel. The residual errors of these estimators are used as hypothesis to test and assign the conditional probabilities of each model in the bank of the Information form of the Kalman filter. The conditional probabilities are used to calculate weighting factors for each estimation and to compute the final system state estimation as a weighted sum. Then, the weighting factors are updated recursively from one to another sequence of infrared images, providing to the estimator a way to follow the dynamic of the scene recorded by the infrared imaging system. The ability of the scheme to adaptively compensates nonuniformity in infrared imagery is demonstrated by using real infrared image sequences. Topic: Image and Video Processing and Analysis Keywords: Image Sequence Processing, Focal Plane Arrays, Nonuniformity Correction, Kalman Filtering.
1 Introduction Infrared (IR) imaging systems utilized in scientific, industrial, and military applications employ an IR sensor to digitise the information. Due to its compactness, cost-effective production, and high performance the most used integrated technology in IR sensors is the Focal Plane Array (FPA) [1]. An IR-FPA is a die composed of a group of photodetectors placed in a plane forming a matrix of X × Y pixels, which gives the sensor the ability to collect the IR information. It is well known that nonuniformity noise in IR imaging sensors, which is due to *
This work was partially supported by the ‘Fondo Nacional de Ciencia y Tecnología’ FONDECYT of the Chilean government, project number 1020433 and by Grant Milenio ICM P02-049. The authors wish to thank Ernest E. Armstrong (OptiMetrics Inc., USA) for collecting the data, and the United States Air Force Research Laboratory, Ohio, USA.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 413–420, 2004. © Springer-Verlag Berlin Heidelberg 2004
414
J.E. Pezoa and S.N. Torres
pixel-to-pixel variation in the detectors’ responses, can considerably degrade the quality of IR images since it results in a fixed-pattern-noise (FPN) that is superimposed on the true image [2]. Even more, what makes matter worse is that the nonuniformity slowly varies over time, and depending on the technology used, this drift can take from minutes to hours [2]. In order to solve this problem, several scene-based nonuniformity correction (NUC) techniques have been developed. Scene-based techniques perform the NUC using only the video sequences that are being imaged [2,3,4], not requiring any kind of laboratory calibration technique. In particular, our group has been active in the development of novel scene-based algorithms for NUC based on statistical estimation theory. In [3,4], we have developed a Gauss-Markov model to capture the slow variation in the FPN and have utilized the model to adaptively estimate the nonuniformity in the infrared video sequence using a Kalman filter. In that development, we assumed a known linear state-space system model; but in practical situations, there are exist uncertainties in the parameters of such model In this paper, a multi-model adaptive estimation (MMAE) technique to compensate for NUC in IR video sequences, based on the mentioned Kalman filters and capable to reduce modelling uncertainties, is developed. The algorithm employs the parallel processing technique for adaptive Kalman filtering suggested in [5] and it is computationally improved by the use of the Information form of the Kalman (IFK) filter [4]. A bank of IFK filters is used to compensate for NUC, and the residuals of these estimators are used as hypothesis to test and assign the conditional probabilities of each model in the bank of IFK filters. Then, the conditional probabilities are used to compute weighting factors for each estimation and to compute the final system state estimation as a weighted sum. Additionally, the weighting factors are updated recursively from one to another sequence of IR video, providing to the estimator a way to follow the dynamic of the scene recorded by the IR imaging system. This paper is organized as follows. In Section 2 the model of the system is presented and the derivation of the multi-model algorithm is developed. In Section 3 the adaptive filtering NUC technique is tested with five video sequences of real raw IR data. In Section 4 the conclusions of the paper are summarized.
2
Adaptive Multi-model Estimation of Nonuniformity for Infrared Video Sequences
The model for each pixel of the IR-FPA is a linear relationship between the input irradiance and the detector response [2,3,4]. Further, for a single detector in the FPA, vectors of readout data are considered corresponding to a sequence of videos of frames for which no significant drift in the nonuniformity parameters (the gain and the bias) occurs within each video. For the video of frames, the linear input-output relation of the ij-th detector in the frame is approximated by [1]:
Multi-model Adaptive Estimation for Nonuniformity Correction
415
where and are the ij-th detector’s gain and bias, respectively, at the video of frames. represents the average number of photons that are detected by the ij-th detector during the integration time associated with the frame of the video. is the additive readout (temporal) noise associated to the ij-th detector for the frame during the video of frames. In addition, the vector is an vector of readout values for the ij-th element of the FPA associated with the video. For simplicity of notation, the pixel superscripts ij will be omitted with the understanding that all operations are performed on a pixel-bypixel basis. According to [3,4], the slow drift in the nonuniformity between videos of frames is modeled by a Gauss-Markov process for the gain and the bias of each pixel on the FPA. This is:
Here is the state vector comprising the gain and the bias at the video-time and is the 2 × 2 transition diagonal matrix between the states at and with its diagonal elements being the parameters and that represent, respectively, the level of drift in the gain and bias between consecutive videos. is a 2 × 2 noise identity matrix that randomly relates the driving (or process) noise vector to the state vector The components of are and the random driving noise for the gain and the bias, respectively, at the video-time. A key requirement that it was imposed on (2) is that the state vector must be a stationary random process since, in practice, the drift in the gain and bias randomly changes the FPN but it should not alter its severity. All others assumptions are shown and justified in detail elsewhere [3]. The observation model for a given video of frames is an extension of the linear model (1) and it can be cast as
where is the observation matrix of dimension in which the first column contains the input per frame and is the additive temporal noise vector. The main assumption in the observation model (3) is that the input in the video in any detector is an independent sequence of uniformly-distributed random variables in the range that is common to all detectors in each video of frames [3]. Using the previous model and their assumptions, Torres et.al. developed a Kalman filter capable of compensate for NUC [3], and recently they presented the IFK filter and demonstrated that is computationally more efficient than its predecessor estimating the gain and offset [4]. It is well known that for real IR video sequences there exist uncertainties in the following parameters of the model (2),(3): and It is also known that a good knowledge of the system parameters and noise covariances represents an enormous benefit for the estimation of the states. Fortunately, several methods have been developed to deal with this kind of problems. The MMAE is one of
416
J.E. Pezoa and S.N. Torres
them, and it consists of the use of a parallel process technique for, adaptively estimating the state variables of the system under study using multiple parameters for such system and Kalman filters [3,4].
2.1
The Multiple Model Adaptive Estimator
The MMAE estimator for NUC in IR video sequences is developed assuming that the system model (2,3) contains unknown and unvarying parameters. These parameters are to be represented by a discrete random vector defined over a finite sample space: with known or assumed a priori probabilities Further, to develop the MMAE estimator, at the video time it is necessary to find the form of a minimum variance of error estimator, for the system state based on the noisy measurement set and based on a known set Now, it can be demonstrated [5] that a minimum variance of error estimation can be formed according to:
where, is a vector containing all the video sequences (or measurement sets), and is the probability of given It is well known that the minimum variance estimate can be written as then can be obtained from the Kalman filter algorithm developed in [3,4] under the assumption that Thus, the minimum variance of error estimate, described by (4) is a weighted sum of estimates from the N parallel Kalman filters designed under the assumptions Now, we apply the Bayes’ rule to yielding the relationships [5]:
Then, it can be seen in (5) that the denominator is just a normalizing constant in the recursive equation obtained for the weighting factors used in (4). Also, a closer examination of the equation (5) shows that the calculation of is crucial for the algorithm’s development. Fortunately, this term is readily implemented for gaussian signal models in terms of its conditional mean and covariance matrix [5]. In this case, is gaussian with mean (i.e. the a priori estimation of based on the system model), and covariance
Multi-model Adaptive Estimation for Nonuniformity Correction
417
Now, expanding the conditional mean and covariance matrix in terms of the known system quantities, the term can be calculated as follows:
where is the a priori error covariance matrix and is a matrix, for the model, containing the cross covariance function of the noise plus a constant term obtained from the constrain that the state vector must be a stationary random process [3,4]. Therefore, the MMAE algorithm considers as a first step the computing of the model probability using equations (6) to (8). Then, the next step is calculate the weighting factors given in equation (5). Lastly, the algorithm computes the final weighted estimation formulated by the equation (4). Since no measurements have been taken at then initial values for are assumed to be: Discussion. The MMAE, formulated by the equations (4) to (8) fits perfectly with the IFK filter, because the quantities required to compute the probabilities and the weighting factors are available for all the models due to the normal operation of the N filters. On the other hand, one would hope that the use of the MMAE in a particular situation would imply that if the true value of were, say, then and as Indeed, results of this type hold, and the convergence requirements for MMAE algorithms are well established in literature [5]. The fundamental condition needed to be satisfied is the ergodicity or the asymptotically wide sense stationarity of the residuals. It can be demonstrated that these residuals are wide sense stationary and, further, their expected values and autocorrelations remain constant, so the convergence of the approach is assured [5]. Finally, it can be seen in equation (5) that, if one of the N filter’s conditional probability becomes zero, it will remain zero for all time. This effect causes the MMAE to ignore information from that particular filter. To avoid this problem, a lower probability bound will be set for each individual filter. Then, if this situation occurs all computed probabilities must be rescaled.
3
Adaptive Estimation for Nonuniformity Correction of Real Infrared Image Sequences
In this section the MMAE algorithm is applied to five videos of terrestrial midwave IR imagery that were imaged using a 128 × 128 InSb FPA cooled camera (Amber Model AE-4128). The IR videos were collected at different hours of the same day (6:30 AM, 8 AM, 9:30 AM, 11 AM and 1 PM), each video originally contained 4200 frames captured at a rate of 30 fps, and each pixel was quantized in 16-bit integers.
418
J.E. Pezoa and S.N. Torres
For brevity, three models, and therefore, three parallel IFK filters form the base of the MMAE for evaluating the uncertain quantities in the matrices: and NUC was performed per pixel by subtracting the estimated bias from the video sequences and dividing the outcome by the estimated gain. Also, the performance of the whole system for NUC was evaluated by means of the reduction in the roughness parameter [3] and the convergence of the algorithm was assessed meaning the weighting factors at each video time. Lastly, the lower probability bounds were established to because the heuristics of the process shows, at that number, a good trade-off with the response of the MMAE. The procedure used to select the set of appropriated values for the system parameters is described in [4]. Also, in this paper, it is assumed a known range for the average IR irradiance collected by each detector [4]. To select the drifting parameters between video sequences and and to select the initial mean values for the gain and the bias per pixel will be the main task of the proposed MMAE. Thus, these quantities and their combinations form the unknown parameters vector and the discrete sample space, respectively. To evaluate the effectiveness of the MMAE in selecting the best possible values for it is necessary to quantify the uncertain parameters in such matrix, that are the mean value and the standard deviation of the gain and the bias [3, 4]. Our experience with the system give us to assume the following: for the first model a mean value for the gain (bias) of 1.3 (-4000), for the second model a mean value of 1 (0), and for the third model a mean value of 0.7 (4000). Also, the initial standard deviation for the gain (bias) was 2% (5%) for all the models, and the drifting factors were set to 0.93 (0.93) for the three cases, representing a low drift in these parameters. Then, the initial a priori probabilities were established equal for all models. Lastly, the videos were placed in a temporal sort. The execution of the MMAE algorithm provides the results given in Table 1 and shown in Fig. 1. From the former, it can be noted that the estimator converges considerably fast to best model (the first one). The Table also shows that the model selected as the most suited produces the best roughness parameter for the corrected video sequences. It can be also seen in Fig. 1 that the MMAE compensates for the dead pixels that appear in the real imagery, since they are interpreted by the algorithm as cases of extremely low gain.
Multi-model Adaptive Estimation for Nonuniformity Correction
419
Fig. 1. Examples of the results for the study of The left image shows a real uncorrected IR frame and the right image is the corrected version generated by the MMAE at the video sequence.
The second matter to analyze is the main goal of the paper: to determine the level of drift in the gain and the bias between a given IR video sequences, For this propose, the video sequences were not sorted in temporal order to avoid any kind of relationship between them. In fact, the videos were ordered as follows: 11 AM, 8 AM, 1 PM, 6:30 AM and 9:30 AM. The results obtained in the previous test for the mean gain (bias) of 1.3 (-4000) and a standard deviation of 2% (5%) are used when correspond as initial values. Now, the drifting parameters and were established to be 0.55 for the first model, 0.95 for the second, and 0.75 for the last one [4]. Fig. 3 shows the mean value of the weighting factors for the evaluated situation. These values indicated that there exist a moderated tendency to drift in the nonuniformity parameters. Indeed, it can be observed that the mean values of the estimated gain at is 1.22 and at is 1.45, representing a variation of 16%, whereas the bias varies in a 28% between the same video sequences. Also, note in Fig. 3 that, in this case, the algorithm slowly tends to model as accordingly to the convergence result expressed in the discussion. The Fig. 2 shows a real and a corrected frame from the fourth video sequence (6:30 AM). The NUC obtained for the IR sequence was somehow satisfactory, but ghosting artifacts are showed up over the corrected images [3,4]. However, we have observed that such ghosting artifacts can be reduced using more frames [3,4].
4
Conclusions
In this paper we have developed a multi-model adaptive estimator for nonuniformity correction in infrared video sequences based in a bank of our previous Kalman filters and in a recursive algorithm that statistically weights the outputs of the bank of filters and computes the final estimation of nonuniformity parameters. Our evaluations, using real corrupted infrared video sequences, have shown that the approach effectively performs the NUC of the video sequences, and that it converges to the best set of parameters defined for each model, in accord with the theoretical convergence commented in the discussion. A practical
420
J.E. Pezoa and S.N. Torres
Fig. 2. Examples of the results for the second case. The images show a real uncorrected IR frame and the corrected version generated by the MMAE at the video sequence.
Fig. 3. The evolution of the weighting factors per video sequence time for each model considered. The signs represents the first, the second, and the third model, respectivelly
consideration was made to improve the computational performance of the algorithm: lower probability bounds were established to avoid that one estimation turns and remains zero.
References 1. Holst, G.: CCD arrays, cameras and displays. SPIE Opt. Eng. Press. Bellingham. (1996). 2. Harris, J., Chiang, Y-M.: Nonuniformity Correction of Infrared Image Sequences Using the Constant-Statistics Constraint. IEEE Trans. on Image Proc. 8. (1999) 1148–1151. 3. Torres, S., Hayat, M.: Kalman filtering for adaptive nonuniformity correction in infrared focal plane arrays. The JOSA-A Opt. Soc. of America. 20. (2003) 470–480. 4. Torres, S., Pezoa, J., Hayat, M.: Scene-based Nonuniformity Correction for Focal Plane Arrays Using the Method of the Inverse Covariance Form. OSA App. Opt. Inf. Proc. 42. (2003) 5872–5881. 5. Anderson, B. , Moore, J.: Optimal filtering. Prentice-Hall, New Jersey, NY. (1979).
A MRF Based Segmentatiom Approach to Classification Using Dempster Shafer Fusion for Multisensor Imagery A. Sarkar1, N. Banerjee2, P. Nair1, A. Banerjee1, S. Brahma2, B. Kartikeyan3, and K.L. Majumder3 1
2
Department of Mathematics, IIT Kharagpur, Department of Computer Science and Engineering, IIT Kharagpur, 3 Space Application Centre, Ahmedabad, India
Abstract. A technique has been suggested for multisensor data fusion to obtain landcover classification. It takes care of feature level fusion with Dempster-Shafer rule and data level fusion with Markov Random Field model based approach vis-a-vis for determining the optimal segmentation. Subsequently, segments are validated and classification accuracy for the test data is evaluated. Two illustrations of data fusion of optical images and a Synthetic Aperture Radar (SAR) image is presented and accuracy results are compared with those of some recent techniques in literature for the same image data. Index Terms- Dempster-Shafer Theory, Hotelling’s Markov Random Field(MRF), Fisher’s discriminant.
1
Introduction
We address the problem of landcover classification for multisensor images that are similar in nature. Images acquired over the same site by different sensors are to be analyzed by combining the information from them. The role of feature level fusion using Dempster-Shafer (DS) rule and that of data level fusion in MRF context have been studied in this work to obtain an optimal segmented image. This segmented image is then labelled with groundtruth classes by a cluster validation scheme to obtain the classified image. Classification accuracy results of the method are evaluated with test set data and compared with that of some of recent works in literature. A number of techniques are available in the literature [2,3,5,7] for analyzing data from different sensors or sources. An extensive review work is given in Abidi and Gonzales [1]. A very brief survey is also available in [7]. Among many approaches for data fusion Dempster theory of evidence has created a lot of interest although its isolated pixel by pixel use has not shown much encouraging results. A statistical approach with similar sources has been investigated in [5] under the assumption of multivariate Gaussian distribution incorporating a source reliability factor. This work [5] also demonstrates the use of the mathematical theory of evidence or DS rule for aggregating the recommendations A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 421–428, 2004. © Springer-Verlag Berlin Heidelberg 2004
422
A. Sarkar et al.
of the two sources. A methodological framework due to Solberg et al. [7], that considers the important element of spatial context as well as temporal context is the Markov Random Field model for multisource classification. An interesting work of Bendjebbour et al. [2] demonstrates the use of DS theory in Markovian context. In this work the MRF has been defined over pixel sites, and as such, the computation time for such an approach is expected to be very high when large number of groundtruth classes occur in a natural scene, which is usually the case. We attempt this investigation in a way similar to [6]. After an initial segmentation performed by a technique developed in the framework for tonal region image , we define a MRF on the sites comprising the initial oversegmented regions. Such oversegmented regions are expected to be merged, resulting in an optimal segmentation through an energy minimization process associated with the underlying MRF. To consider evidences from different sensors, DS fusion is carried out pixel by pixel and is incorporated in the Markovian context while obtaining the optimal segmentation by the energy minimization scheme associated with the MRF. To incorporate the DS fusion we associate a binary variable with the energy function. This binary variable takes values depending upon some characteristics of DS labelling of the pixels of two adjacent regions in the clique potential function. If a specific DS label is found to be common to the majority of the pixels in each of these two adjacent regions then the binary variable takes the value one otherwise it is zero. Through this binary variable in the energy function, mixing of the feature level DS fusion is carried out in the data level fusion process for obtaining the optimal segmentation. The originality of the paper lies in underlining how the features of DS theory may be exploited in the MRF based segmentation approach to classification of natural scenes without much of intensive computation. The paper is organized as follows. Section 2 describes an evidential approach for multisource data analysis with the derivation of mass functions that are used in DS fusion. Section 3 describes the MRF model based segmentation scheme. Section 4 discusses the experimental results and concludes the paper.
2
Evidential Approach for Multisource Data Analysis
We consider N separate data sensors(/sources), each providing a measurement for a pixel of interest, where and being the number of rows and columns of the image. Here is a vector for a multidimensional source. Suppose there are K classes(true state of the nature), into which the pixels are to be classified according to per pixel approach. The classification method involves labelling of pixels as belonging to one of these classes. We consider here pixel specific numerical data after appropriately co-aligning the pixels arising out of different sensors. As is well known, the mathematical theory of evidence or Dempster-Shafer (DS) theory [8,5] is a field in which the contributions from separate sources of data, numerical or non-numerical, can be combined to provide a joint inference con-
A MRF Based Segmentations Approach to Classification
423
cerning the labelling of the pixels. For the N sensors we thus have N mass functions These functions have the following characteristics, (i) where is an empty set meaning thereby a null proposition. (ii) where is the set of propositions for pixel labelling and represents all possible propositions. In DS theory of evidence, two more functions that are derived from this mass function, viz., plausibility(Pls) and belief (Bel) (see [8]). The question now is how can we bring the evidences from each of the sources together to get a joint recommendation on the pixel’s label with some confidence by increasing the amount of global information while decreasing its imprecision and uncertainty. The rule of aggregating evidences from different sources is called the Dempster’s orthogonal sum or rule of combination [8] and is given by the combined mass function as follows:
In remote sensing landcover classification the union of two or more labelling propositions is of little interest as the classes considered are usually distinct. Further, its mass is determined substantially by the masses for each of the simple propositions concerned. If our labelling propositions are then the three functions viz., M, Bel and Pls are equal and give the same decision for labelling. Hence any of them may be adopted. Mass Function Derivation. Let us consider a set of classes into which pixels are to be classified from N different sensor image data. For each of the jth sensor, the ith class conditional density is determined with the help of the ground truth samples. With equal prior probabilities these class conditional density functions become posterior probabilities. Now for each pixel we calculate and assign label to the sth pixel, if At this stage each of the pixels in different images will have a label. For if then the sth pixel of the DS output image is labelled as The remaining pixels that have been assigned contradictory labels by different sensors (i.e., for some are kept unlabelled in the DS output image at this stage. For unlabelled pixels, we consider the power set of where and determine normalized mass functions for each of the sensors as given over the set by eqn(5) below. Such unlabelled pixels will be labelled by the maximum belief rule or the maximum plausibility rule for singleton classes using combined mass functions. Thus, for each sensor we calculate the mass functions: where is the intensity value of the sth pixel for the jth sensor and These mass functions are the probabilities on While com-
424
A. Sarkar et al.
puting probabilities of the combined hypotheses we use the simple elementary rule of probability of union of events for singleton classes. In this way, a singleton class with a high probability reflects its dominance in the aggregate evidence and hence a pixel is more likely to be labelled with such a class which is possibly a better reflection of the true scene. Thus, the unlabelled sth pixel in the DS labelled output image is now labelled with the maximum belief rule, that is, labelled with if where is given by eqn(3). In the next section we outline the procedure to determine the optimal segmentation based on maximum aposterior probability (MAP) estimate.
3
MRF Model Based Segmentation Scheme
We follow the scheme of Sarkar et.al. [6] in defining the MRF on a region adjacency graph(RAG) of initial oversegmented regions- the details are omitted here. Our discussion here is directed in formulating the energy function that takes the features of DS theory in MRF based segmentation approach. Minimizing this energy function will result in a MAP estimate of the optimal segmented image. After carrying out initial segmentation following the approach as in [6] on each of the selected channels of all the different sensors (say N in number) producing segments these segments are intersected among each other to give rise to a set of new segments comprising a merged initial segmented image which is then passed as an input to the MRF model. Since each of the sensor images are co-aligned pixel by pixel and the intensity values are all numerical we may consider all the sensor data together as if they were from a single source having multiple channels. It is assumed that the merged initially segmented image has Q number of regions and a set of labels each a set of discrete values or labels, corresponding to the spectral classes of the image. The objective we adopt is to assign the region labels satisfying the constraints of an optimal segmentation for channels multisensor imagery. We impose two constraints as per our notion of optimal segmentation from multisensor image data. (i) An optimal segmented image region should be uniform with respect to the measured characteristics as obtained from all the sensors. (ii) Two distinct adjacent regions and should be as dissimilar as possible with respect to the measured characteristic as evident from the combined evidence from all the sensors. As per merged initial segmented image, the same regions are grown in each of the channels of the different sensors. Thus, the multichannel image is initially segmented into a set of Q disjoint regions denoted by Representing each region as a node with multichannel information, a RAG, is defined, where is a set of nodes and E is a set of edges
A MRF Based Segmentatiom Approach to Classification
425
connecting them. With appropriate neighborhood system a MRF is defined (see details in [6]). The posterior probability distribution is given by where The events {X = x} and {Y = y} represent respectively a specfic labelling configuration and a specific realization. Since the energy function is a sum of the clique potentials it is necessary to select appropriate cliques and clique potential functions to achieve the desired objective. For the cliques and clique potential functions only the set of adjacent two-region pairs, each of which is directly connected in the RAG are considered here. The two components of the energy fuction as per the two constraints are denoted as the region process(H) and the edge process(B) respectively. Let represent the mean intensity vectors of the initially segmented regions where each is a (P × 1) vector, the number of pixels in the region and let represent the scatter matrices, that is, is a (P × P) matrix with elements of sum of squared deviations from the mean and sum of cross-product deviations. Region process (H): A measure of the uniformity of the region with respect to its intensity values is given by the elements of the matrix of sum of squares deviations from mean and cross products, i.e., or equivalently by the generalized covariance ( see [6] ) of the region With the above measure of uniformity, the evidences from different sensors may also be combined with the following scheme. This scheme takes into account the pattern of the DS labels of pixels of two regions and belonging to a clique and thus examines whether the majority pixels of each of these two regions are having the same class labels. If and are regions belonging to a clique then corresponding to the first constraint a clique potential function [6] can be defined as where and is a binary variable taking values 0 and 1. It takes the value 1 when the following two conditions are together satisfied. The first condition is on the pattern of DS labels for the regions and as mentioned above. The second condition is that regions are homogeneous with respect to the multisensor pixel intensity values. If any of the above two conditions is violated, takes the value 0. It may be noted that indicates that With this variable the feature level fusion is coupled with data level fusion in the energy minimization process. However, with only the above definition, the dissimilarity between adjacent regions is not taken into account and the formulation of the energy function is not complete. Therefore, an edge process is introduced through the second constraint as given below. Edge process (B): We note that merging the two distinct regions and results in a new scatter matrix of the merged region as given by The third term is also a P × P matrix whose elements exhibit a measure of dissimilarity existing between the regions and Incorporating the edge process we re-define the clique potential function as
426
A. Sarkar et al.
The parameter controls the weight to be given to the two processes for regions involved in the clique For convenience we write the above equation as Here, and A suitable comparative criterion among the elements of these two matrices and is necessary for deciding the merging of two adjacent regions. Since the ratio of and can be expressed as where the comparative criterion needed here is based on Hotelling’s statistics. Therefore, given that the Dempster Shafer labelling is same (according to the region labelling scheme followed) for the regions and in the clique, the regions should be merged if and the regions should not be merged if where [as in [6, p. 1106]]. It is also to be noted here that the minimization of the energy function has been investigated by first identifying the node having the maximum aggregate clique potential with its neighbor The segmented image so obtained is by minimizing the energy function as described above. The flowchart of the methodology is depicted in Fig.1. Cluster Validation Scheme: In order to validate the segments of the optimal segmented image we follow the first stage of the cluster validation scheme of Sarkar et.al. [6] and label the unlabelled segments using Fisher’s method for discriminating among K ground truth classes.
4
Experimental Results and Conclusion
The proposed methodology of Markov Random Field based segmentation approach to classification for multisensor data in the context of DS theory has been applied to two subscenes. Both of them are of one optical sensor with four channels and a SAR image of the same site. The first subscene is of size 442×649 and the second subscene is of size 707×908. The date of acquisition of the subscenes for optical image was 19 January 2000 and for the SAR was 30 September 1999, thus having a time lag of 110 days. There are 12 and 16 different land cover classes involved in the first and the second subscenes respectively. The total available groundtruth samples which equals about 5.5% of the total number of pixels of the first subscene and 3.5% that of the second subscene are divided into two subsamples. For both these subscenes, the first subsample is used for labelling some of the clusters as in [6] and subsequently, the remaining clusters are labelled with the help of these labelled clusters using Fisher’s discriminant scores. The second subsamples in each of the subscene are used for the quantitative evaluation of the classification accuracy after all clusters are validated. The measurements from
A MRF Based Segmentatiom Approach to Classification
427
Fig. 1. Multisensor Image Segmentation Scheme
different sensors is assumed to be conditionally independent [7]. The probability density function(pdf) of the SAR intensity distribution after it is made speckle free has been considered to be Gaussian. The pdf of an optical image with 4 channels(bands) is considered to be multivariate Gaussian. We investigate the following approaches. Case (i) (Proposed Method): Initial segmentation is first performed in each of the sensor’s selected channel. These initial segmented images, one on channel-2 of the optical image and the other on the SAR image, are then merged as described in section 3. The aggregate evidences of the different sensors as obtained with eqn (3) are then incorporated into data level fusion in image space (spatial context) through the energy minimization process. Finally, a cluster validation scheme is applied to this segmented image. For the sake of comparison we have investigated the approach of Tupin et al.[9] as case(ii) and two other nonparametric multisensor fusion methods as case(iii) and case(iv) respectively. Case (ii): In this approach [9] a direct classification is done on the initial segmented regions of the RAG using the DS rule. Unlike the proposed methodology where a separate set of labels is used for labelling the RAG, the regions are labelled here from the set of thematic (groundtruth) classes. Case (iii): Multilayer Perceptron and Case (iv): Radial Basis functions [4]. A comparison of classification accuracies of the proposed methodology along with case (ii) through case (iv) for both subscenes are presented in Table-I. This table provides normalized classification accuracies and time durations in a
428
A. Sarkar et al.
Pentium IV system with 1.86GHz and 512MB RAM. The in Table-I indicates the methods over which Case(i) is significantly better (by Kappa coefficients). The test results show that the proposed method has an edge over other methods.
Acknowledgment. (The work was supported by the ISRO Grant, Ref: 10/4/416, Dt. 27 Feb.2003)
References 1. M.A. Abidi and R.C. Gonzalez,“Data Fusion in Robotics and Machine Intelligence”. New York, Academic, 1992. 2. A.Bendjebbour, Y.Delignon, L.Fouque,V.Samson and W Pieczynski,“Multisensor Image Segmentation Using Dempster-Shafer Fusion in Markov Fields Context”, IEEE Trans. Geosci. Remote Sensing, vol.39, no.8, pp 1789-1798, Aug. 2001. 3. S. Le Hegarat-Mascle, I. Bloch and D. Vidal-Madjar, “Application of DempsterShafer Evidence Theory to Unsupervised Classification in Multisource Remote Sensing”, IEEE Trans. Geosci. Remote Sensing, vol.35, pp 1018-1031, July 1997. 4. Y.S.Hwang and S.Y.Bang,“ An efficient method to construct a radial basis function neural network classifier”,Neural Networks,vol.10, pp. 1495- 1503, August 1997. 5. T.Lee, J.A.Richards and P.H.Swain,“Probabilistic and evidential approaches for multisource data analysis”,IEEE Trans. Geosci. Remote Sensing, vol.GRS-25, pp 283-293, May 1987. 6. A. Sarkar, MK Biswas, B.Kartikeyan, V.Kumar, K.L.Majundar and D.K.Pal, “A MRF Model Based Segmentation Approach to Classification for Multispectral Imagery”, IEEE Trans. Geosci. Remote Sensing, vol.40, pp 1102-1113, May 2002. 7. A.H.Schistad Solberg, T.Taxt and A.K.Jain, “ A Markov random field model for classification of multisource satellite imagery”,IEEE Trans. Geosci. Remote Sensing, vol.34, pp 100-113, Jan. 1996. 8. G. Shafer, “A Mathematical Theory of evidence.” Princeton, NJ: Princeton University Press, 1976. 9. F. Tupin, I.Bloch and H.Maitre,“ A First step towards automatic interpretation of SAR images using evidential fusion of several structure detectors”,IEEE Trans.of Geosci. Remote Sensing, vol.37, pp 1327-1343, Mar. 1999
Regularized RBF Networks for Hyperspectral Data Classification G. Camps-Valls1, A.J. Serrano-López1, L. Gómez-Chova1, J.D. Martín-Guerrero1, J. Calpe-Maravilla1, and J. Moreno2 1
Grup de Processament Digital de Senyals, Universitat de València, Spain.
[email protected], http://gpds.uv.es/ 2
Departament de Termodinàmica, Universitat de València, Spain.
Abstract. In this paper, we analyze several regularized types of Radial Basis Function (RBF) Networks for crop classification using hyperspectral images. We compare the regularized RBF neural network with Support Vector Machines (SVM) using the RBF kernel, and AdaBoost Regularized (ABR) algorithm using RBF bases, in terms of accuracy and robustness. Several scenarios of increasing input space dimensionality are tested for six images containing six crop classes. Also, regularization, sparseness, and knowledge extraction are paid attention. Several conclusions are drawn: (1) all models offer similar accuracy but SVM and ABR yield slightly better results than RBFNN; (2) results indicate that ABR are less affected by the curse of dimensionality and has identified efficiently the presence of noisy bands; (3) we find that regularization is a useful method to work with noisy data distributions; and (4) some physical consequences are extracted from the trained models. Finally, this preliminary work lead us to think of kernel-based machines as efficient and robust methods for hyperspectral data classification.
1 Introduction The information contained in hyperspectral data about the chemical properties of the surface allows the characterization, identification, and classification of the surface features by means of recognition of unique spectral signatures, with improved accuracy and robustness. Pattern recognition methods have proven to be effective techniques in applications of this kind [1]. In the recent years, many supervised methods have been developed to tackle the problem of automatic hyperspectral data classification. A succesful approach is based on the use of neural networks, both multilayer perceptrons (MLP) [2], or Radial Basis Function Neural Networks (RBFNN) [3]. The latter have shown excellent robustness and accuracy results, given the Gaussian nature of many multi and hyperspectral data. Intimately related to RBFNN, the use of Support Vector Machines (SVM) has been recently shown excellent results [4,5]. SVMs can handle large input spaces, which is especially convenient when working with hyperspectral data; can effectively avoid overfitting by controlling the margin; and can automatically identify a small subset made up of informative pixels in the image, namely support vectors (SV) [6]. Lately, the use of combined experts have opened A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 429–436, 2004. © Springer-Verlag Berlin Heidelberg 2004
430
G. Camps-Valls et al.
a wide field in the pattern recognition community. In this context, a promising boosting algorithm is AdaBoost and its regularized version (AdaBoost Regularized, ABR) [7], which is intimately related with SVM [8], and other kernel methods [9]. The aim of this communication is to benchmark the aforementioned regularized RBF-based kernel methods in hyperspectral data classification. The communication is outlined as follows. In Section 2, the classification methods used are described. The classification results are presented in Section 3. Some conclusions and a proposal for future work end this paper in Section 4.
2
Regularized RBF Networks in Feature Spaces
In a two-class problem, working in feature spaces implies mapping a labeled training data set where and to a higher dimensional space by means of a nonlinear mapping For a given problem, one then considers the learning algorithm working in instead of This is implicitly done by one hidden layer neural networks or boosting methods, where the input data are mapped to some representation given by the hidden layer or the hypothesis space, respectively. In order to control the capacity of the models and avoid overfitting, the solution is usually regularized. In this context, we analyze three implementations of Regularized RBF Networks working in feature spaces.
2.1
Radial Basis Function Neural Network (RBFNN)
In a Radial Basis Function neural network (RBFNN), the sigmoid-shape activation function of an MLP is substituted by a Gaussian function. The output of the network is computed as a linear combination where represent the weights of the output layer, and the Gaussian functions are given by where and denote means and variances, respectively. The learning rule to update weight and variance vectors can be easily derived by using the delta rule. In practice, RBFNN usually incorporate a regularization term into the functional to be minimized:
which trade-offs the minimization of the empirical error and the norm of the weights to produce smoother solutions.
2.2
Support Vector Machines (SVM)
Following the previous notation, the SVM method minimizes the regularized functional
Regularized RBF Networks for Hyperspectral Data Classification
431
constrained to:
where w and define a linear classifier in feature spaces. The non-linear mapping function is performed in accordance with Cover’s theorem, which guarantees that the transformed samples are more likely to be linearly separable in the resulting feature space. The regularization parameter C controls the generalization capabilities of the classifier and it can be selected by the user, and are positive slack variables allowing to deal with permitted errors [6]. Due to the high dimensionality of vector variable w, primal function (2) is usually solved through its Lagrangian dual problem and taking advantage of the “kernel trick”. The basic idea of this method is that data appear in the training algorithm in the form of dot products, Therefore, if data are previously mapped to some other Euclidean space they appear again in the form One does not need to know explicitly the mapping consequently, but only the kernel function K(·,·). In this work, we have used the Gaussian kernel, The main advantage of using the SVM with RBF kernel w.r.t RBFNN is that the centers of the Gaussians are tuned automatically [10].
2.3
AdaBoost Regularized (ABR)
The AdaBoost algorithm, introduced in [11] takes as input a labeled training set and calls a weak or base learning algorithm iteratively, In this paper, we use RBF as base learners. In each iteration a certain confidence weight is given and updated to each training sample On each iteration, the weights of incorrectly classified samples are increased so that the weak learner is forced to focus on the hard patterns in the training set. The task of the base learner reduces to find a hypothesis as follows: appropriate for the distribution The goodness of a weak hypothesis is measured by its error, Once the weak hypothesis has been calculated, AdaBoost chooses a parameter which measures the importance assigned to Note that if and that gets larger as gets smaller. The distribution is next updated in order to increase the weight of samples misclassified by and to decrease the weight of correctly classified patterns [11]. Thus, weight tends to concentrate on difficult samples, which reminds somewhat support vectors. The final hypothesis is a weighted majority vote of the T weak hypotheses where is a weight assigned to Consequently, for each instance the weak hypothesis yields a prediction whose sign is the predicted label (–1 or +1), and whose magnitude gives a measure of confidence in the prediction. SVMs and AdaBoosting are explicitly related by observing that any hypothesis set implies a mapping and therefore also a kernel where In fact, any hypothesis set H spans a
432
G. Camps-Valls et al.
feature space which is obtained by some mapping and the corresponding hypothesis set can be constructed by [9]. Therefore, AdaBoosting can be expressed as the maximization of the smallest margin w.r.t. w and constrained to
At this point, note the relationship among expressions (1), (2), and (5). The AdaBoost algorithm can be regularized leading to the AdaBoost Regularized (ABR) algorithm [7], in which we will focus in this paper.
3 3.1
Results Data Collection and Setup
We have used six hyperspectral images (700×670 pixels) acquired with the 128-bands HyMap spectrometer during the DAISEX-1999 campaign, under the Scientific Analysis of the European Space Agency (ESA) Airborne MultiAnnual Imaging Spectrometer Campaign (more details at http://io.uv.es/ projects/ daisex/). After data acquisition, a preliminary test was carried out to measure the quality of data. No significant signs of coherent noise were found. A Fig. 1. Incoherent high level of noise was found at bands noise in a HyMap and for DAISEX-99 (Fig. 1). In fact, image observed in bands 1 and 128 were no longer available in DAISEX-2000 Band campaign. Bands 2, 66, 67, and 97 were also considered for alfalfa crop. noisy bands due to their high variability. This issue constitutes an a priori difficulty for classifiers that take into account all available bands. In [12], we selected four relevant subsets of representative bands (containing 128, 6, 3, and 2 bands) by means of classification trees. In this paper, we evaluate performance of methods in these four scenarios. For classification purposes, six different classes were considered in the area (corn, sugar beet, barley, wheat, alfalfa, and soil), which were labeled from #1 to #6, respectively. Training and validation sets were formed by 150 samples/class and models were selected using the cross-validation method. Finally, a test set consisting of the true map on the scene over complete images was used as the final performance indicator, which constitutes an excellent confidence margin for the least measured error.
3.2
Model Development
All simulations were performed in MATLAB®. In the case of RBFNN, the number of Gaussian neurons were tuned between 2 and 50, and was varied exponentially between In order to develop an SVM, we tried exponentially increase sequences of and
Regularized RBF Networks for Hyperspectral Data Classification
433
For the case of ABR algorithm, the regularization term was varied in the range and the number of iterations was tuned to T = 10. The width and centers of the Gaussians are computed iteratively in the algorithm.
3.3
Model Comparison
Accuracy and robustness. Table 1 shows the average recognition rate (ARR%) for RBFNN, SVM, and ABR in training, validation, and test sets. The ARR% is calculated as the rate of correctly classified samples over the total number of samples averaged over the six available images. Some conclusions can be drawn from Table 1. All models offer, in general, similar recognition rates.SVM and ABR yield better results than RBFNN in the test set for low-dimensional input spaces but no numerical differences are observed for 128-bands set. In fact, RBFNN and ABR produce very similar results, which could be explained by the fact that both methods adjust weights and variances. We can conclude that all methods have identified efficiently the presence of noisy bands in the 128-bands training dataset (see Section 3.1), all of them reporting good results. Users and producers. Table 2 shows the confusion matrix of an image provided by the best classifier (ABR, 128 bands). We also include users accuracy and producers accuracy for each class. Users accuracy (UA[%]) calculates correctly classified samples in a desired class over the total samples in that desired classs. Producers accuracy (PA[%]) is the calculation of correctly classified samples in a predicted class over the total samples in that predicted class. In general, high rates of users accuracy are observed (UA>95%). However, producers accuracy are lower (PA > 84%), specially significant for sugar beets and corn. This was due to the fact that sugar beets were in an early stage of phenology and showed
434
G. Camps-Valls et al.
small coverture and the soil was rather heterogeneous, and corn was in an early stage of maturity which produces a bias in models to misclassify bare soils (class #6) as corn (class (#1). This problem was also observed for SVM [5]. By using ABR users accuracy for soil has increased a 2.5% and producers accuracy for corn has increased by 4.5%. This could be explained by the fact that ABR concentrates its resources on difficult patterns to be classified and the solution is controlled with the regularization parameter. Figure 2 shows the original and the classified samples using the best approach (ABR, 128 bands) for one of the collected images. Corn classification seems to be the most troublesome, which could be due to the presFig. 2. Left: RGB composite of the red, green and ence of a whole field of twoblue channels from 128-bands HyMAP image taken in leaf corn in the early stage June, 1999. Right: Map of the whole image classified of maturity, where soil was with the labels of the classes of interest. predominant and was not accounted for the reference labeled image. The confusion matrix supports this conclusion as most of the errors are committed with the bare soil class. Effect of regularization. Regularization is a very useful technique to obtain smoother solutions in the presence of outliers and difficult samples in the data distribution. For illustration purposes, Fig. 3 shows the solution provided by the best models for the problem of corn-barley discrimination using 2-input-bands classifiers. We selected this problem because of the high inter-class overlapping. RBFNN produces a too complex decision boundary. In fact, good results are obtained at the expense of utilizing much more hidden neurons in the RBFNN
Regularized RBF Networks for Hyperspectral Data Classification
435
Fig. 3. Discrimination of Barley-Corn classes for the best 2-bands classifiers. The training patterns for each class are shown as and ‘+’, respectively. Decision lines for the best (a) RBFNN, (b) SVM, and (c) ABR versus bands 17 (x-axis) and 22 (y-axis) of the 2-bands classifiers.
than RBF nodes in ABR. SVM offers a smoother solution but some samples (high values of reflectance bands) are modeled by means of isolated local boundaries. On the other hand, ABR produces a rather simple decision function but several samples need extremely local boundaries to be correctly modeled. A nice property of boosting is its ability to identify outliers in the data distribution, i.e. mislabeled, ambiguous, or hard samples to classify. This, however, can make the model to over-concentrate on the most difficult examples, which must be controled by incorporating a regularization term that ensures smoothness in the solution. We can conclude that the additional flexibility of RBFNN or ABR must be controlled very carefully with their corresponding regularization parameters in order to avoid overfitting or over-smoothed solutions. Model complexity and sparsity. The best RBFNN and ABR classifiers were formed by sixteen and five hidden nodes, respectively (Table 1). The best SVM classifier (6 bands) was formed by 78 support vectors (SVs), namely 8.67% of the whole training data set, which indicates that a very reduced subset of examples is necessary to attain significant results. SVMs and ABR work in very high-dimensional feature spaces and both lead to sparse solutions although in different spaces. In fact, boosting can be thought as an SVM approach in a highdimensional feature space spanned by the base hypothesis of some function set H (Eq. (5)), and uses effectively an regularizer, which induces sparsity. Contrarily, one can think of SVM as a “boosting approach” in a high-dimensional space in which, by means of the “kernel trick”, we never work explicitly in the feature space.
4
Conclusions
In this communication, we have compared the use of regularized RBF-based methods for hyperspectral data classification. We have benchmarked RBFNN, SVM with RBF kernels, and AdaBoost Regularized in terms of accuracy and
436
G. Camps-Valls et al.
robustness. The issues of robustness to outliers and the presence of noisy situations have been addressed by all methods. Future work will consider other kernel-based methods, such as kernel Fisher discriminant. Acknowledgments. The authors want to express their gratitude to Prof. Lorenzo Bruzzone from the Università Degli Studi di Trento (Italy) for his useful comments on this paper.
References [1] Swain, P.: Fundamentals of pattern recognition in remote sensing. In: Remote Sensing: The Quantitative Approach. McGraw-Hill, New York, NY (1978) 136– 188 [2] Bischof, H., Leona, A.: Finding optimal neural networks for land use classification. IEEE Transactions on Geoscience and Remote Sensing 36 (1998) 337–341 [3] Bruzzone, L., Fernandez-Prieto, D.: A technique for the selection of kernelfunction parameters in RBF neural networks for classification of remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing 37 (1999) 1179– 1184 [4] Huang, C., Davis, L.S., Townshend, J.R.G.: An assessment of support vector machines for land cover classification. International Journal of Remote Sensing 23 (2002) 725–749 [5] Camps-Valls, G., Gómez-Chova, L., Calpe, J., Soria, E., Martín, J.D., Alonso, L., Moreno, J.: Robust support vector method for hyperspectral data classification and knowledge discovery. IEEE Transactions on Geoscience and Remote Sensing 42 (2004) 1–13 [6] Schölkopf, B., Smola, A.: Learning with Kernels – Support Vector Machines, Regularization, Optimization and Beyond. MIT Press Series (2001) [7] Rätsch, G., Schökopf, B., Smola, A., Mika, S., Onoda, T., Müller, K.R.: Robust ensemble learning. In Smola, A., Bartlett, P., Schölkopf, B., Schuurmans, D., eds.: Advances in Large Margin Classifiers. MIT Press, Cambridge, MA (1999) 207–219 [8] Rätsch, G., Mika, S., Schölkopf, B., Müller, K.R.: Constructing boosting algorithms from SVMs: an application to one-class classification. IEEE PAMI (2002) In press. Earlier version is GMD TechReport No. 119, 2000. [9] Müller, K.R., Mika, S., Rätsch, G., Tsuda, K.: An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12 (2001) 181–201 [10] Schölkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Processing 45 (1997) 2758 – 2765 AI Memo No. 1599, MIT, Cambridge. [11] Schapire, R.: The strength of weak learnability. Machine Learning 5 (1990) 197– 227 [12] Gómez-Chova, L., Calpe, J., Soria, E., Camps-Valls, G., Martín, J.D., Moreno, J.: CART-based feature selection of hyperspectral images for crop cover classification. In: IEEE International Conference on Image Processing, Barcelona, Spain (2003)
A Change-Detection Algorithm Enabling Intelligent Background Maintenance Luigi Di Stefano1,2, Stefano Mattoccia1,2, and Martino Mola1,2 1
2
Department of Electronics Computer Science and Systems (DEIS) University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Advanced Research Center on Electronic Systems ‘Ercole De Castro’ (ARCES) University of Bologna, Via Toffano 2/2, 40135 Bologna, Italy {ldistefano,smattoccia,mmola}@deis.unibo.it
Abstract. We have recently proposed a change-detection algorithm based on the idea of incorporating into the background model a set of simple low-level features capable of capturing effectively “structural” information. In this paper we show how this algorithm can naturally interact with the higher-level processing modules found in advanced videobased surveillance systems so as to allow for flexible and intelligent background maintenance.
1
Introduction
Advanced videosurveillance systems typically include as the first image analysis step a change-detection algorithm aimed at segmenting out the interesting regions from a background. Then, higher level processing modules, such as tracking, classification and interpretation modules, process the output of the change detection algorithm to attain the required degree of scene understanding. Most change-detection algorithms rely on the principle of background subtraction: a background model is compared to the current image in order to mark as foreground those pixels that exhibit a significant difference with respect to the corresponding background pixels. The main difficulty associated with change-detection is not the background subtraction step, but instead the maintenance of a background model that follows correctly the changes of the reference scene. These can be grouped into illumination changes and changes due to objects. The latter occur when an object is introduced into or removed from the reference scene. If a foreground object stops a decision should be taken on whether and when it is more appropriate to include its appearance into the background model. Similarly, if a background object starts moving, its absence in the previously occupied image region is detected as a bogus blob (usually known as ghost). In this case, the elimination of the ghost is typically desirable and can be achieved by updating the background in the region previously occupied by the object. We have recently proposed a novel change-detection approach [1] that relies on a background model very robust with respect to illumination changes, so A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 437–445, 2004. © Springer-Verlag Berlin Heidelberg 2004
438
L. Di Stefano, S. Mattoccia, and M. Mola
that in principle it needs to be updated only to handle the insertion/removal of objects. In this paper we show how the changes due to objects can be dealt with effectively and intelligently by exploiting an interaction between the changedetection level and the higher processing levels found in advanced video-based surveillance systems.
2
Previous Work and Proposed Approach
Among the change detection algorithms relying on the background subtraction principle, the statistical approach is the most widely adopted one [2,3,4,5,6]. With this approach some features are used to represent the background pixels (e.g. RGB components, hue, luminance, gradient) and modelled by a probability distribution. A pixel of the current image is then classified as foreground if the observed features are not coherent with the associated probability distribution. Background maintenance consists typically in updating the parameters of the probability distributions based on the last observed images. As regards the background subtraction principle, the method proposed in [6] is the most similar to ours since image gradient is used to achieve robustness with respect to illumination changes and the combination of gradient and colour information is done at region-level. However, we extract gradient information at a reduced resolution to significantly improve robustness and explicitly exploit illuminationinsensitivity within the background maintenance process. The idea of exploiting an interaction between the change-detection level and higher-level modules can be found also in [5]. Yet, this method is much more complex since it relies on combining in a Mixtures of Gaussians framework colour information with the depth measurements provided by a stereo system. Excluding [5], in the above mentioned algorithms a foreground object is immediately and gradually included into the background as soon as it stops moving so that, after a certain time interval, it will be perceived as background. Similarly, a ghost will be absorbed into the background according to the same dynamics. It is worth pointing out that this common strategy allows the changedetection algorithm to recover from persistent false positive errors and that this is of fundamental importance for proper long-term functioning of the algorithm. However, this strategy relies on a user-selectable time constant that determines the time needed for a pixel repetitively classified as foreground to become part of the background. If the time constant is fast false positives are absorbed very quickly into the background model but slowly moving objects may corrupt the background. Conversely, if the time constant is slow, slowly moving objects are correctly detected but recovering from false positive errors takes a long time. Ideally, if the background model were based on features invariant to illumination changes the above problem would be removed since the background model should be updated only to accommodate the changes due to objects. Though it seems impossible to find low-level features invariant to every kind of illumination changes, it is possible to devise low-level features resulting robust with respect to many illumination changes occurring in practise. Starting from these consid-
A Change-Detection Algorithm
439
erations, we have devised a set of very simple low-level features, referred to as image structure, that have proven to be very robust with respect to illumination variations. By including these features into the background model we have obtained a novel change detection algorithm that approximates satisfactorily the ideal behaviour outlined previously. Since in principle our algorithm relies on a background model that needs to be updated only to handle “high-level” events, namely insertion/removal of objects, it naturally holds the potential to interact with higher-level processing modules that may control the background maintenance process flexibly and intelligently. Thus, our algorithm has been designed to support easily this kind of interaction: it can accept a binary mask that controls the inclusion/removal of objects into the background model. The results reported in our previous paper [1] were obtained running the algorithm in “stand-alone” mode (i.e. without any feedback from higher-level modules). Here we discuss some new results that demonstrate how the algorithm can usefully interact with a simple higher-level module that tracks and classifies the blobs provided by the change-detection level.
3
The Change-Detection Algorithm
Given a grey-level image, I, the first step needed to extract image structure, consists in obtaining a reduced-resolution image, R[I]. Let be a scale factor, then R[I] is a image defined as:
Then we obtain two additional
images defined as:
are simply the horizontal and vertical derivatives of the reducedresolution image R[I], and the pair forms the structure of the original grey-level image I. To obtain structure in the case of colour images we simply apply the described transformation to each of the three RGB channels. In this manner we obtain six images: (the second subscript identifies the colour channel). We have found that in real video sequences the structure variations produced by illumination changes are usually much smaller than those caused by true “structural” changes of the scene. However, a change detection process based solely on structure would detect changes with a very low resolution, yielding inaccurate blobs. To overcome this problem we adopt a background model made out of two separate parts: the first is the structure of the scene without moving objects while the second is simply a colour image of the reference scene. The first part of the model will be referred to as background structure, its components indicated as
440
L. Di Stefano, S. Mattoccia, and M. Mola
Since its variations can be largely ascribed to objects, its updating is focused on handling this type of changes, with a simple mechanism allowing a feedback from a higher-level processing module. The second part of the model will be referred to as background image, its components indicated as It will provide the algorithm with the capability of detecting blobs at the highest possible resolution. Given the described background model, at each new frame the detection process operates at both the structure and image level; then, the detection results are combined adequately to obtain the final output. Before starting the detection process, our algorithm activates a simple “bootstrap” process aimed at estimating the initial data to be included into the background model. Structure-Level Detection. We compare the structure of the current frame, I, with the background structure by building up two delta-structure images associated respectively with the and directions:
Then, choosing a suitable threshold value, and recalling equation (2), we can observe that if at structure element then a foreground object occupies (or a background object leaves) the image region associated with structure element or or both. Similarly if at structure element the structure change could be located at element or or both. Therefore, we define the binary image containing the structure-level detection results as:
Image-Level Detection. In this case the detection step is simpler and consists in computing the difference between I and the background image. Hence, calling the colour channels of the current frame and a new threshold value, we define the binary image containing the image-level detection results as:
Combination of the Detection Results. The information contained in is used to decide whether or not each of the blobs detected in is valid. The validation is done by labelling the connected components of and erasing those not intersecting at least one structure-element marked with 1 in The result of the combination step is a binary image, Mask, that contains only the blobs associated with objects changing image structure.
A Change-Detection Algorithm
441
Updating of the Background Structure. If at a certain structure element the difference between the background structure and the current image structure is persistently above the threshold the system estimates the value that should be assigned to the element to absorb the change. The estimation process consists in observing the temporal behaviour of the above-threshold structure element until it exhibits a stable value. Only when this occurs the estimated value may be used to update the background structure. In fact, the actual updating must be enabled explicitly by a feedback information coming from a higher-level module. This consists of a binary image FB, as large as Mask, in which the higher-level module should redraw the blobs associated with “interesting” objects. Then, the structure elements intersecting at least one blob drawn in FB will not be updated, even though a stable value might be available. Conversely, if a structure element exhibiting a persistent change does not intersect any blob drawn in FB and an estimated value is available, the estimated value is copied into the background structure. Updating of the Background Image. Indicating the background image before and after the updating as and respectively, and considering the red channel, the updating rule is given by:
where is a constant value. The same rule is used also for the green and blue channels.
4
Experimental Results
The experiments are aimed at demonstrating how our change detection algorithm can usefully interact with a simple tracking-classification module. The scene is a parking lot, where we can observe pedestrians and vehicles approaching to the parking lot or leaving the scene after a prolonged stop. The tracking system is similar to that described in [7]: it tracks the blobs provided by the change detection module by establishing frame-by-frame correspondences on the basis of distance measurements and handles blob merging-split events by means of a set of heuristic rules. Many other tracking systems (e.g. [8,9,10,4]) rely on a blob-based approach, adopting different strategies to handle blob mergingsplit events. The feedback action between the change detection algorithm and the higher-level module is aimed at facilitating the tracking task by avoiding as many blob-merging events as possible and by not including into the background the objects that stay still for a short time frame. In a parking lot scenario, if we use a traditional change detection algorithm with a slow time constant, every time a car stops we observe a motionless blob that could merge with other objects moving in the same area, as could be the passengers getting off the parked car. Moreover when a car moves away after a stop, a ghost is produced and its blob is a potential source of merges with
442
L. Di Stefano, S. Mattoccia, and M. Mola
the blob of other objects moving around. These two problems could be partially solved by adopting a fast time constant. Unfortunately, with this choice if a person temporarily stops moving, he will be no longer detected and will soon produce a ghost when walking away. Moreover, in a parking lot the detection of a motionless person is typically desirable to discover suspicious activities. The above problems can be dealt with effectively by exploiting our algorithm’s capability to receive feedback information from a higher-level module. Basically, the change-detection algorithm should be controlled so as to continuously detect persons and ceasing rapidly the detection of still cars and ghosts by absorbing them quickly into the background. Recalling section 3, this can be obtained by always redrawing in FB all the detected blobs but still cars and ghosts. To proceed in this way, a classifier capable of recognising still cars and ghosts must be built on top of the tracking module. The very simple classifier adopted in our experiments is based on the following rules: 1. A tracked object having a motionless blob larger than a fixed threshold is a still car. 2. After a blob splits, if one of the resulting blobs is motionless and larger than a fixed threshold while the other resulting blobs are moving, then the motionless blob is the ghost of a car moved away. The first rule allows fast insertion of all still cars into the background. In these cases when successively the passengers get off the parked car, a ghost may appear inside the vehicle. These kind of ghosts can be detected by means of the following additional rule: 3. If a blob appears within the region associated with a previously absorbed still car and successively splits into still and moving blobs, then the moving blobs are classified as passengers and the still ones as ghosts. Figure 1 shows several frames from a sequence with a parking vehicle. Each snapshot contains on the left the tracking results (a labelled bounding-box superimposed on the original image) and on the right the change-detection output (a binary image). In snapshots (a) and (b) a car approaching to a parking lot is detected and easily tracked until it stops. The time elapsed between the first two snapshots is 4,88 seconds (as indicated in the figure). Now, our simple classifier applies rule 1 and recognise the tracked object as a still car, thus enabling its quick inclusion into the background. In fact, after 6 seconds (snapshot c), the car is still detected, but in the successive frame (snapshot d) its blob disappears instantaneously since the car has been included into the background model. As e result, in (e) only the pedestrian passing-by the parked car is detected. Hence, the tracking of the pedestrian has been significantly facilitated by avoiding a merge with the parked car. In (f), after 10,64 seconds, the driver starts getting off the car. The associated blob is tracked and, as expected, a ghost is produced inside the car. When the driver splits away from the ghost, the classifier recognize the motionless blob as a ghost (snapshot g) by applying rule 3, and consequently after only 1,4 seconds, the ghost is included into background and no longer detected (snapshot h). Now the driver’s blob is the only one to be detected and hence its tracking is straightforward. In (i) the driver
A Change-Detection Algorithm
443
Fig. 1. A sequence with a parking vehicle.
stops walking and rests completely motionless in front of the car for more than 13 seconds. Though motionless for a long time frame, his blob is continuously detected, and hence straightforwardly tracked since the classifier point it out as an “interesting” object (i.e. not a still car nor a ghost), thus avoiding its inclusion into the background. When successively the driver walks away no ghost is produced, as can be seen in (m) where another pedestrian passes through the region previously occupied by the driver. Figure 2 shows some frames from a sequence taken some minutes after the previous one. Initially, the background model contains the parked car, that subsequently will leave the parking lot. In snapshot (a) the driver enters the scene and in (b) gets into the car. So far only the driver has been detected, but when the car stars moving (snapshot c), the blob shown in the output is originated by a real object (i.e. the car) as well as its ghost. In snapshot (d) a split event occurs and in (e) the classifier applies rule 2 to recognize the motionless blob as a ghost. Consequently, after a few seconds (snapshot f and g), the still car no longer belongs to the background and hence the ghost is instantaneously eliminated from the output, as can be seen in (h) where a pedestrian is correctly detected and straightforwardly tracked while walking through the region previously occupied by the ghost.
5
Conclusion
The proposed change detection algorithm relies on a background model that has been designed to receive feedback information from higher-level processing mod-
444
L. Di Stefano, S. Mattoccia, and M. Mola
Fig. 2. A sequence with a leaving vehicle.
ules, thus allowing for flexible and intelligent control of the insertion/removal of objects into the background model. This can be deployed to attain a changedetection output optimised with respect to the specific requirements of the addressed application. We have demonstrated this capability considering a parking lot and showing that the change-detection output can be optimised so as to facilitate significantly the blob-tracking task by minimising the merges between still cars, ghosts and persons as well as by enabling continuous detection of still persons. This has been achieved by deploying the feedback action from the tracking-classification level to the change-detection level so as to handle properly the inclusion into the background of the tracked objects classified respectively as persons, still cars and ghost. Finally, we point out that the very simple system used in our experiments could be employed in a classical change-triggered Digital Video Recording application. In such a case, unlike a conventional change-detector, our system would trigger the recording only when a car is moving or a person is present, thus storing only the true relevant frames.
References 1. Di Stefano, L., Mattoccia, S., Mola, M.: A change detection algorithm based on structure and color. In: Int. Conf. on Advanced Video and Signal Based Surveillance. (2003) 2. Wren, C., et al: Pfinder: Real-time tracking of the human body. IEEE PAMI 19 (1997) 3. Haritaoglu, I., Harwood, D., Davis, L.: W4 who? when? where? what? a real time system for detecting and tracking people. In: Int. Conf. on Automatic Face and Gesture Recognition. (1998)
A Change-Detection Algorithm
445
4. Stauffer, C., Crimson, W.: Adaptive background mixture models for real-time tracking. In: Int. Conf. on Computer Vision and Pattern Recognition. (1999) 246– 252 5. Harville, M.: A framework for high-level feedback to adaptive, per-pixel, mixtureof-gaussian background models. In: European Conf. on Computer Vision. (2002) 6. Javed, O., Shafique, K., Sha, M.: A hierarchical approach to robust background subtraction using color and gradient information. In: Workshop on Motion and Video Computing. (2002) 7. Di Stefano, L., Mola, M., Neri, G., Viarani, E.: A rule-based tracking system for video surveillance applications. In: Int. Conf. on Knowledge Based Engineering Systems (KES). (2002) 8. McKenna, S., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking groups of people. Computer Vision and Image Understanding 80 (2000) 42–56 9. Rosales, R., Scarloff, S.: Improved tracking of multiple humans with trajectory prediction and occlusion modelling. In: Int. Conf. on Computer Vision and Pattern Recognition. (1998) 10. Senior, A., Hampapur, A., Ying-Li, T., Brown, L., Pankanti, S., Bolle, R.: Appearance models for occlusion handling. In: Int. Work, on Performance Evaluation of Tracking Systems. (2001)
Dimension Reduction and Pre-emphasis for Compression of Hyperspectral Images1 C. Lee, E. Choi, J. Choe, and T. Jeong Dept. Electrical and Electronic Eng., Yonsei University, BERC 134 Shinchon-Dong, Seodaumoon-Ku, 120-749 Seoul, KOREA
[email protected]
Abstract. As the dimensionality of remotely sensed data increases, the need for efficient compression algorithms for hyperspectral images also increases. However, when hyperspectral images are compressed with conventional image compression algorithms, which have been developed to minimize mean squared errors, discriminant information necessary to distinguish among classes may be lost during compression process. In this paper, we propose to enhance such discriminant information prior to compression. In particular, we first find a new basis where class separability is better represented by applying a feature extraction method. However, due to high correlations between adjacent bands of hyperspectral data, we have singularity problems in applying feature extraction methods. In order to address the problem, we first reduce the dimension of data and then find a new basis by applying a feature extraction algorithm. Finally, dominant discriminant features are enhanced and the enhanced data are compressed using a conventional compression algorithm such as 3D SPIHT. Experiments show that the proposed compression method provides improved classification accuracies compared to the existing compression algorithms.
1 Introduction Remote sensing has been used in numerous applications, which include geology, meteorology, and environment monitoring. As the sensor technology advances, the dimensional of remotely-sensed data sharply increases. It is expected that future sensors generate a very large amount of data from remote sensing systems on a regular basis. Consequently, there are increasing needs for efficient compression algorithms for hyperspectral images. In particular, compression is required to transmit and archive hyperspectral data in many cases. A number of researchers have studied the compression of hyperspectral data [1-6]. For instance, Roger and Cavenor applied a standard DPCM-based lossless compression scheme to AVIRIS data using a number of linear predictors, where pixel residuals are encoded by using a variable-length coding (3-D DPCM) [2]. Karhunen-Loeve transform can be also used to remove the spectral redundancy, which may be often followed by two-dimensional transforms 1
This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 446–453, 2004. © Springer-Verlag Berlin Heidelberg 2004
Dimension Reduction and Pre-emphasis for Compression
447
such as discrete cosine transform (DCT) and discrete wavelet transform (DWT). Furthermore, efforts have been made to apply standard compression algorithms such as JPEG and JPEG 2000 to the compression of multispectral imagery. Recently, several authors have applied the SPIHT algorithm to the compression of multispectral imagery [4, 5]. Most conventional image compression algorithms have been developed to minimize mean squared errors. However, discriminant features of the original data, which are required to distinguish among various classes in classification problems, are not necessarily large in energy. Consequently, when hyperspectral images are compressed with conventional image compression algorithms, discriminant features may be lost during compression process. In order to preserve such discriminant information, we should take into account discriminating information of remote sensed data in designing compression. Recently, efforts have been made to enhance such discriminant features prior to compression [6]. In particular, feature vectors which are dominant in discriminant power are found by applying a feature extraction algorithm and such features are enhanced. However, due to high correlations between adjacent bands of hyperspectral data, there is a singularity problem since most feature extraction methods require the inverse of covariance matrices. In order to avoid this singularity problem, in [6], the spectral bands are divided into a number of groups and feature extraction was performed in each group. A problem with this approach is that discriminant features which utilize the entire spectral bands may not be enhanced. In this paper, in order to address the singularity problem, we first reduce the dimension of hyperspectral data and then find a new basis by applying a feature extraction algorithm. Since the new basis is a basis for the reduced dimension, we expand it to obtain a basis for the original dimensional space. An advantage of the proposed method is that feature extraction can be performed using the entire spectral bands. Depending on the number of available training samples, one can determine the reduction ratio of dimensionality. After feature extraction, we have a new basis where discriminant information is better represented. Then, we emphasize features which are dominant in discriminating power and apply a conventional compression algorithm such as a 3-D SPIHT to the images whose discriminant features are enhanced.
2 Feature Extraction and Pre-emphasis Most feature extraction methods for classification problems produce new feature vectors where class separability can be better represented. In canonical analysis [7], a within-class scatter matrix and a between-class scatter matrix are used to formulate a criterion function and a vector d is selected to maximize,
where
(within-class scatter matrix)
448
C. Lee et al.
Here and are the mean vector, the covariance matrix, and the prior probability of class respectively. In canonical analysis, effectiveness of new feature vectors are quantified by (1). In other words, the effectiveness of feature vector d can be computed by a criterion function which produces a number. However, the criterion function is not directly related to the classification accuracy. In the decision boundary feature extraction method, feature vectors are directly extracted from decision boundaries which a classifier defines [8]. In particular, the decision boundary feature matrix is defined as
N(X) is the unit normal vector to the decision boundary at point X on the decision boundary for a given pattern classification problem, p(X) is a probability density function, and S is the decision boundary, and the integral is performed over the decision boundary. It was shown that the eigenvectors of the decision boundary feature matrix of a pattern recognition problem corresponding to non-zero eigenvalues are the necessary feature vectors to achieve the same classification accuracy as in the original space for the pattern recognition problem. It was also shown that eigenvectors of the decision boundary feature matrix corresponding to zero eigenvalues do not contribute classification accuracy. Therefore, the eigenvectors of the decision boundary feature matrix are used as a new feature set. In general, feature extraction methods produce a new feature vector set where class separability is better represented. Furthermore, all feature extraction methods for classification problems provides a way to quantify the effectiveness of new feature vectors. In most cases, a subset of retains most discriminating power. Thus, in the proposed compression algorithm, we first enhance these feature vectors whose discriminant powers are dominant. Let be a new feature vector set produce by a feature extraction algorithm and it is assumed that has more discriminating power than if i < j . It is also assumed that is a basis of the N-dimensional Euclidean space. Then, an observation, X, can be represented by
In the proposed compression algorithm with enhanced discriminant features, the coefficients of feature vectors which are dominant in discriminant power are enhanced as follows:
Dimension Reduction and Pre-emphasis for Compression
449
where is a weight in accordance with discriminating power of the corresponding feature vector. Then, the pre-enhanced data ( X ' ) are compressed using a conventional compression algorithm such as 3D SPIHT. In order to reconstruct the original data from the compressed data, the following equations are used:
It is assumed that and are available at the encoder and decoder. In fact, they are parts of the compressed data. In this paper, we use the 3-D SPIHT as a compression algorithm and used the decision boundary feature extraction method and tested the following weight functions: Weight Function 1: eigenvalue of the decision boundary feature Weight Function 2:
matrix [8]) stair function (width=5 bands).
3 Problems of High Dimensionality and Dimension Reduction The data used in this paper was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), which contains 220 spectral bands [9]. Most feature extraction methods, which include canonical analysis and the decision boundary feature extraction method, require computation of covariance matrices which should be invertible. However, due to high correlations between adjacent bands, in most cases, the covariance matrix of 220 spectral bands may not be invertible, even though there are a very large number of training samples. In order to address this problem, a possible solution is to group the spectral bands. For instance, in [6], the spectral bands are divided into a number of groups and feature extraction was performed in each group. A problem with this approach is that discriminant features which require the entire spectral bands may not be enhanced. In this paper, we propose a different solution to deal with the singularity problem of high dimensional data. We first reduce the dimensionality of the hyperspectral data by combining adjacent bands. For an easy illustration, it is assumed that we reduce the dimension in half by combining every two adjacent bands. If we combine every two adjacent bands, this combining of adjacent bands can be expressed as follows:
where A is 110 × 220 and given by
450
C. Lee et al.
Then, we compute covariance matrices of these data of a reduced dimension. The dimension of Y is 110 × 1. It is also noted that the dimension of is 110 × 1. With the reduced dimension, we can more accurately estimate covariance matrices and the resulting feature extraction would be reliable. However, the dimension of feature vectors found in this way is different from that of the original data. In order to find the corresponding feature vector in the original space, we first expand the dimension of to the original dimension by repeating every element as follows:
where norm. It can be easily shown that
and
is multiplied to ensure a unit are orthogonal:
Using the Gram-Schmidt procedure, we can construct an orthonormal basis that includes Let be such an orthonormal basis. Then, one can use equations (2-5) to enhance discriminant information and compress images.
4 Experiments and Results The data used in the experiment were acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). The data set contains 220 spectral bands. From the data, we selected several classes which have enough numbers of training samples. The selected classes are shown in Fig. 1. We used the band combination method to reduce the dimension of data [10]. In order to evaluate the performance of the proposed algorithm, we computed SNRs and classification accuracies, and compared them with those of non-enhanced data. In the AVIRIS data, each channel is assumed to have 12-bit resolution. In this paper, the bit-rate is defined on a per-band basis. In other words, 1 bpp indicates 1 bit per pixel for each band. In order to evaluate the performance of the compression algorithm, we computed SNR which is defined as
Dimension Reduction and Pre-emphasis for Compression
451
The value of was approximately From the selected area, we chose 15 classes and Table 1 shows class information. We use the decision boundary feature extraction to enhance features which are dominant in discriminating power. The Gaussian ML classifier was used assuming the Gaussian distribution for each class and the 220 bands were reduced to 20 features by combining adjacent bands [10]. It is noted that combining adjacent bands was performed after decompressing the compressed data. In other words, the compression was performed using the original data.
Fig. 1. Selected sub-regions (Bands = 17, 27, 50)
Fig. 2 shows performance comparison (SNR) with different band divisions (2 groups, 4 groups, 11 groups) for the two weight functions (eigenvalues and stair functions). It can be seen that the stair function (Weight Function 2) showed noticeable improvements compared to Weight Function 1, while providing performances comparable to those of the compression without any pre-emphasis. It appears that the stair function provides better performances than Weight Function 1 which uses eigenvalues. Figs. 34 show classification accuracies for the reconstructed images (bpp=0.1, 0.4). As can be seen in the figures, the proposed compression methods with pre-enhancement provide noticeably better classification performances than the compression without any pre-enhancement. It appears that class separability is preserved even at bpp=0.1. However, at high bit rates (bpp=1.2), it is observed that the differences between the compression methods with pre-enhancement and the compression without any preemphasis become small (figures are not shown). Fig. 4 shows classification accuracies of the reconstructed data (0.4 bpp) for different reduction ratios (reduced by 2, 4 and 11). It appears that the improvement of classification accuracies slightly increases particularly in the training data, when the reduction ratio becomes large, though the improvement is not consistent. The optimal reduction ratio should be determined by considering available training samples.
452
C. Lee et al.
5 Conclusions In this paper, we propose a compression method for hyperspectral images with preemphasis. In particular, we first reduce the dimension of hyperspectral data in order to address the singularity problem which arises due to high correlations between adjacent bands of hyperspectral data. Then we apply a feature extraction method to find a new basis where class separability is better represented. Finally, dominant discriminant features are enhanced and we apply a compression algorithm such as the 3D SPIHT. Experiments show that the proposed method provides improved classification accuracies than the existing compression algorithms.
Fig. 2. Performance comparison (SNR) with different reduction ratios (reduced by 2, 4, and 11)
Fig. 3. Comparison of classification accuracies for reconstructed data (0.1 bpp). (a) training accuracies, (b) test accuracies
Dimension Reduction and Pre-emphasis for Compression
453
Fig. 4. Comparison of classification accuracies for reconstructed data (0.4 bpp) for several reduction ratios (reduced by 2, 4, and 11). (a) training accuracies, (b) test accuracies
Acknowledgment. The authors would like to thank Prof. David A. Landgrebe, Purdue University, for providing the valuable AVIRIS data.
References 1.
J. A. Saghri, A. G. Tescher, and J. T. Reagan, “Practical transform coding of multispectral imagery,” IEEE Signal Processing Magazine, vol. 12, no. 1, pp. 32-43, 1995. 2. R. E. Roger and M. C. Cavenor, “Lossless compression of AVIRIS images,” IEEE Trans. Image Processing, vol. 5, no. 5, pp. 713–719, 1996. 3. B. Aiazzi, P. S. Alba, L. Alparone and S. Baronti, “Reversible compression of multispectral imagery based on an enhanced inter-band JPEG prediction,” Proc. IEEE IGARSS’97, vol. 4, pp. 1990-1992, 1997. 4. A. Said and W. A. Pearlman, “A new fast and efficient image codec based on set partitioning in hierarchical trees” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 243250, June 1996. 5. P. L. Dragotti, G. Poggi, and A. R. P. Ragozini, “Compression of multispectral images by three-dimensional SPIHT algorithm,” IEEE Trans. Geoscience and Remote Sensing, vol. 38, pp. 416-428, 2000. 6. C. Lee and E. Choi, “Compression of hyperspectral images with enhanced discriminant features,” IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data, 2003. 7. K. Fukunaga: Introduction to Statistical Pattern Recognition, 2nd Edition, Academic Press, New York, 1990. 8. C. Lee and D. A. Landgrebe, “Feature extraction based on the decision boundaries,” IEEE Trans. Pattern Anal. Machine Intell., vol. 15, pp. 388-400, 1993. 9. G. Vane, R. O. Green, T. G. Chrien, H. T. Enmark, E. G. Hansen and W. M. Porter, “The airborne visible/infrared imaging spectrometer (AVIRIS),” Remote Sensing of Environment, vol. 44, pp. 127-143, 1993. 10. Lee and D. A. Landgrebe, “Analyzing High Dimensional Multispectral Data,” IEEE Trans. Geoscience and Remote Sensing, Vol. 31, No. 4, pp. 792-800, 1993.
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry from Uncalibrated Traffic Surveillance Cameras José Melo1, 2, Andrew Naftel1, Alexandre Bernardino2, and José Santos-Victor2 1
University of Manchester Institute of Science and Technology, PO Box 88, Sackville Street, Manchester M60 1QD, UK
[email protected],
[email protected]
2
Instituto Superior Técnico, Av. Rovisco Pais 1049 - 001 Lisboa, Portugal {jpqm,alex,jasv}@isr.ist.utl.pt
Abstract. In this paper, we present a low-level object tracking system that produces accurate vehicle trajectories and estimates the lane geometry using uncalibrated traffic surveillance cameras. A novel algorithm known as Predictive Trajectory Merge-and-Split (PTMS) has been developed to detect partial or complete occlusions during object motion and hence update the number of objects in each tracked blob. This hybrid algorithm is based on the Kalman filter and a set of simple heuristics for temporal analysis. Some preliminary results are presented on the estimation of lane geometry through aggregation and K-means clustering of many individual vehicle trajectories modelled by polynomials of varying degree. We show how this process can be made insensitive to the presence of vehicle lane changes inherent in the data. An advantage of this approach is that estimation of lane geometry can be performed with non-stationary uncalibrated cameras.
1
Introduction
Intelligent traffic surveillance systems are assuming an increasingly important role in highway monitoring and city road management systems. Their purpose, amongst other things, is to provide statistical data on traffic activity such as monitoring vehicle density and to signal potentially abnormal situations. This paper addresses the problem of vehicle segmentation and tracking, screening of partial and complete occlusions and generation of accurate vehicle trajectories when using non-stationary uncalibrated cameras such as operator controlled pan-tiltzoom (PTZ) cameras. We demonstrate that by building a self-consistent aggregation of many individual trajectories and by taking into account vehicle lane changes, lane geometry can be estimated from uncalibrated but stable video sequences. In our work, rather than performing object tracking under partial or total occlusion, we describe an occlusion reasoning approach that detects and counts the number of overlapped objects present in a segmented blob. Trajectory points are then classified according to whether they are generated by a single or overlapped object. This paper A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 454–462, 2004. © Springer-Verlag Berlin Heidelberg 2004
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry
455
describes the Predictive Trajectory Merge-and-Split (PTMS) algorithm for performing the aforementioned task. It uses a Kalman filter (KF) and a set of simple heuristic rules to enforce temporal consistency on merging and splitting overlapping objects within detected blobs. The method is independent of the camera viewpoint and requires no a priori calibration of the image sequences.
2
Review of Previous Work
The starting point for much work in analysing surveillance images is the segmentation of moving objects based on background subtraction methods [1-2]. Typically, each pixel is modelled using a Gaussian distribution built up over a sequence of individual frames and segmentation is then performed using an image differencing strategy. Shadow detection and elimination strategies have been commonly employed to remove extraneous segmented features [4-7]. It is also important to handle partial and complete occlusions in the video data stream [7-10]. Occlusion detection can be performed using an extended Kalman filter that predicts position and size of object bounding regions. Any discrepancy between the predicted and measured areas can be used to classify the type and extent of an occlusion [9], [10]. Higher level traffic analysis systems have also been developed specifically for accident detection at road intersections [9], [11] and estimating traffic speed [12], [13]. More general techniques for object path detection, classification and indexing have also been proposed [10], [14-17]. Our work is most closely related to [10], [12], [13]. In [12] an algorithm to estimate mean traffic speed using uncalibrated cameras is presented. It employs geometric constraints in the image, inter-frame vehicle motion and distribution of vehicle lengths. Traffic flow histograms and the image vanishing point are used in [13] to measure mean speed but it has similar limitations to the previous approach. The work in this paper shows that accurate vehicle trajectories can be built from uncalibrated image sequences and can be aggregated to model lane geometry and ultimately determine traffic speed and classify normal and anomalous situations.
3
Predictive Trajectory Merge-and-Split (PTMS) Algorithm
The proposed system uses a multi-stage approach to determining the vehicle motion trajectories and eventually the lane geometry. Firstly, we build a background model to segment foreground objects. A detected foreground blob comprises a connected region having more than a certain pre-defined minimum number of pixels in its area. A constant acceleration Kalman Filter (KF) is used to track the blobs through image coordinate space. The PTMS algorithm is then used to perform a timeconsistent analysis of those detected blobs allowing for merging and splitting due to partial and complete occlusions. An overview of the system is shown in Fig. 1.
456
J. Melo et al.
Fig. 1. Block diagram of the proposed system
3.1
Background Initialization
We use a Gaussian distribution in the Adaptive Smoothness Method [1] to build a background model. Detected blobs having an area smaller than are deemed to be noise and disregarded. Erode and dilate operations are used to eliminate small holes within blobs. Shadow removal is not incorporated, but during the background update stage, a double thresholding operation is performed to eliminate self-shadowing.
3.2
Steady State Kalman Filter
If we wish to build complete motion histories for each tracked object, i.e. to determine the position of an object at each time step, it is necessary to implement KF[19] to resolve tracking instabilities caused by near and partial occlusions, shadows and image noise. If the case of multiple simultaneous object tracking, if we lose track of one vehicle and another vehicle is suddenly detected nearby, there is an obvious danger of mistaken vehicle identification. Even assuming that vehicles drive at constant velocity, due to camera perspective effects their velocity in the image plane is time varying. Therefore, we approximate vehicle position in the image with a constant acceleration Kalman Filter. In the equations that follow, we work in image coordinates and assume that tuning parameters are the same for objects moving towards and away from the camera. At this stage we are not modelling the noise in vehicle position, thus we use a constant coefficient KF, whose coefficients are manually tuned for good performance
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry
457
We use a steady state version of the KF, often referred to alfa-beta-gamma filter [19]. Let measurement vector X = (x,y) represent the centroid of the detected blob, and the state vector S = (x, y, x’,y’, x”, y”) where prime and double prime denote first and second derivatives with respect to time, i.e. velocity and acceleration in the x, y directions. In the initial state the velocity and acceleration are set to zero. Let and be, respectively, the estimated position, velocity and acceleration at time step k, and and their predicted values. If X(k) is the blob centroid position and T the sampling period, then the filter equations are the following: Update equations:
Prediction equations:
A value of is chosen for the parameters. When the PTMS detects an occlusion, the KF is not updated with the new value of X.
3.3
Heuristic Merge-and-Split Rules
The presence of shadows or ‘near’ occlusions caused by traffic congestion can seriously degrade accuracy of blob detection. Typically, several vehicles may be misdetected as one single vehicle with consequent problems for generating an object trajectory. Approaches based on spatial reasoning use more complex object representations such as templates or trained shape models. However, this is dependent on image resolution and only works under partial occlusion. A better approach is to use a temporal smoothness constraint in checking vehicle positions under different types of occlusion. Here, we propose a set of temporal rules that can easily complement a spatial approach. The algorithm works as follows: First, we define a blob as a connected region resulting from the background subtraction process. Then use KF to predict for each blob the most likely position in the next frame that the blob will appear. Each blob is considered to have a number of children, i.e. number of different objects a blob is composed of. At the beginning, every blob is initialized as having one child. For each frame and for every blob:
458
J. Melo et al.
1. Determine whether there is a 1-1 correspondence by checking size and position of blobs in consecutive frames and comparing positions and sizes. 2. For every blob that does not match the previous condition; determine whether the size has decreased by more than expressed as a percentage. If so, decrease the number of its occluded objects by 1. 3. If any blob has decreased its size by less than
store that information.
4. Determine whether any new blob has appeared in the vicinity of a blob whose size decreased and had a number of children greater than 1. If so, decrease the number of occluded objects in the old blob - the old blob was occluding the new blob. 5. Check if there are any new blobs in the new frame. 6. If there are any new blobs in the same position of several old blobs, it means that the new blob is composed of the old blobs, and the number of its children is increased by the number of the old blobs minus 1.
The algorithm works fairly well for most of the time, the principal drawback is when the initial blob is composed of several objects. In this case, it will be misdetected as one single object. To tackle this problem, a spatial algorithm could be applied to the initial blobs to determine whether they are composed of one or more objects. The results of applying PTMS algorithm are presented in section 5.
4
Estimating Lane Geometry from Object Trajectories
In highly constrained environments such as highways, it is tempting to use vehicle motion trajectories rather than conducting image analysis of static scenes when determining lane geometry. The former approach has a number of advantages: Allows the use of controlled pan-tilt-zoom cameras rather than static cameras. Object trajectories are independent of scale and viewpoint considerations. Motion is more robust than spatial data with respect to light variation and noise. The method assumes that the average lane width in image coordinates is known in advance. However, it does not require a priori knowledge of the number of lanes or road geometry, i.e. whether it is a straight or curved section of highway. First, we apply a pre-filtering stage to remove obviously invalid trajectories that are produced by poor background initialization. Excluded trajectories are those that have consecutive inter-point differences greater than some threshold, or the total length less than some pre-defined threshold. To calculate the approximate centre of each lane, first we fit a least squares polynomial of degree M for each trajectory. The average residual error of fit can be used to ascertain the optimal value of M. Next, we apply a robust K-means clustering algorithm that works in the coefficient space of the polynomials. To reduce the time complexity, we use a heuristic to limit the number of candidate trajectories to those with greater likelihood of belonging to a lane. Finally, the RANSAC [18] algorithm is used on the clustered trajectories to determine a least squares polynomial fit to the
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry
459
lane centres. RANSAC is robust to outlier trajectories produced by frequent vehicle lane changes, undetected overlapped vehicles and noise in the video sequence. Further details of this method are presented in a companion paper.
5
Results
The results of applying PTMS algorithm are now presented. The video sequences were recorded in grey scale at a rate of 15 frames/sec with a 176x144 pixel resolution. In Fig. 2 we show the result of background subtraction. Segmented objects whose areas are denoted in red whereas detected vehicles are coloured purple. We use a different colour to signify the bounding box of a tracked vehicle. When tracking of one vehicle is lost, we place a cross to highlight the position predicted by KF. In this sequence, the influence of KF prediction was not very significant.
Fig. 2. Tracked vehicles
Fig. 3. Tracking and occlusion handling
Fig. 3 shows the result of occlusion handling applied to the previous figure. Observe that the two cars in the left of the image are detected as a single blob, and through the use of PTMS algorithm, we can determine that it corresponds to two cars in the previous frame. The detected blob is displayed with its bounding box in red with a cross drawn in the middle. In Fig. 4 we display the trajectories generated by use of KF and PTMS algorithm applied to the same sequence from which Figs. 2 and 3 were drawn. Trajectories in green correspond to single vehicles successfully tracked, whereas those in purple correspond to vehicles previously detected but whose tracking was subsequently lost. The points are predicted by output of KF. The red points correspond to trajectories of averaged position of two or more overlapped vehicles detected through use of PTMS.
Fig. 4. Vehicle trajectories generated through hybrid tracking and PTMS algorithm
Since the approach adopted is low-level and independent of camera viewpoint and type of object motion, we tested the hybrid tracking and PTMS approach with a
460
J. Melo et al.
different data set recorded at a road intersection. A typical frame taken from the sequence is shown in Fig. 7a. In Fig. 5 we can observe that there are no object occlusions, and all the vehicles are detected as single objects. In Fig. 6 the PTMS algorithm detects a blob comprised of two vehicles and a second blob with four occluding vehicles. An unidentified moving object is mis-detected as comprising two occluding vehicles.
Fig. 5. Tracked vehicles
Fig. 6. Tracking and occlusion handling
In Fig. 7b we display the set of trajectories calculated from the sequence1 7a, with the colours employing the same semantics as in Fig. 4.
Fig. 7 (a) Typical scene at a road intersection. (b) Trajectories
We now show some preliminary results of applying the clustering approach to the computed trajectories described in section 4. The computed point trajectories of single vehicles (Fig. 8a) are used to estimate the lane centres (Fig. 8b) on a curved segment of highway. From a total of 175 partial trajectories in the image sequence, the Kmeans clustering algorithm uses 20 trajectories per lane to estimate the centres. It should be noted that although the original trajectory data contains vehicle lane changes, the RANSAC fitting method can be made insensitive to these by careful parameter tuning. The clustering is carried out as a post-processing operation.
Fig. 8. (a) Original trajectories of single tracked vehicles containing outliers. (b) Estimated lane centres. (c) Processing time for applying clustering algorithm 1
Image sequence downloaded from http://i21www.ira.uka.de/image_sequences
Viewpoint Independent Detection of Vehicle Trajectories and Lane Geometry
461
Next figure illustrates similar results for a straight highway segment using uncalibrated PTZ cameras. Here we start from an initial total of 200 partial trajectories and again use 20 trajectories per lane to estimate the centres.
Fig. 9. (a) Original trajectories of single tracked vehicles containing outliers. (b) Estimated lane centres. (c) Processing time for applying clustering algorithm
The processing times for each frame in the respective sequences are shown in Fig. 8c and Fig. 9c. In each case, the algorithm starts with zero clusters and adds 2 new trajectories per frame. More results can be found at the author webpage2.
6
Discussion and Conclusions
This paper proposes an algorithm for vehicle tracking with the following characteristics; temporal integration with a Kalman Filter, time-consistent mergingand-splitting of overlapped detected blobs, aggregation of trajectory data to estimate lane centres and removal of the need for calibrated cameras. The preliminary results demonstrate the feasibility of using ordinary uncalibrated stationary or PTZ cameras to analyse traffic behaviour in real-time. The algorithm is viewpoint independent and does not make any a priori assumption regarding lane geometry. The results can be used as input to higher level traffic monitoring systems for estimating traffic speed, frequency of lane changes, accident detection and classification of anomalous driver behaviour. We use some limited assumptions regarding camera zoom and image scale. One drawback of the clustering approach is that due to occlusions, vehicle trajectories are sometimes miss detected and hence partitioned into erroneous cluster sets. It is often difficult to distinguish these from genuine lane changes at the postprocessing stage. In future work, we intend to tackle this limitation.
Acknowledgements. This work is partially funded by the Portuguese project ADIINTELTRAF. The authors would like to thanks to ISR and Observit for the video sequences.
2
http://omni.isr.ist.utl.pt/~jpqm/inteltraf.htm
462
J. Melo et al.
References 1.
2. 3. 4.
5.
6. 7.
8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18.
Gutchess D., Trajkovics M., Cohen-Solal E., Lyons D., Jain A.K, A Background Model Initialization Algorithm for Video Surveillance, in proc. IEEE ICCV 2001, Pt.1 (2001) 744-740. I. Haritaoglu, D. Harwood, and L.S. Davis. W4: Real-Time Surveillance of People and Their Activities. IEEE Trans. Patt. Anal. Mach. Intell. 22 (2000) 809-830. Prati, I. Mikié, C. Grana, M. Trivedi , Shadow Detection Algorithms for Traffic Flow Analysis: a Comparative Study, IEEE Trans. Intell. Transport. Syst. (2001) 340-345. A. Elgammal, R. Duraiswami, Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance, Proc. IEEE, 90 (2002). R. Cuchiara, C. Grana, M. Piccardi, A. Prati, Detecting Objects, Shadows and Ghosts in Video Streams By Exploiting Colour and Motion Information Proceedings of 11th International Conference on Image Analysis and Processing ICIAP (2001) S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi: Traffic Monitoring and Accident Detection at Intersections, IEEE Trans. Intell. Transport. Syst. 1 (2000). Koller,D.; Weber,J.; Malik.J.: Robust Multiple Car Tracking with Occlusion Reasoning, Proc.Third European Conference on Computer Vision, LNCS 800, Springer-Verlag (1994). H. Veeraraghavan, O. Masoud, N. Papanikolopoulos: Computer Vision Algorithms for Intersection Monitoring, IEEE Trans. Intell. Transport. Syst. 4 (2003) 78-89. Y.Jung, K. Lee, Y. Ho, Content-Based event retrieval using semantic Scene interpretation for automated traffic surveillance, IEEE Trans. Intell. Transport. Syst. 2 (2001) 151-163. S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi Traffic Monitoring and Accident Detection at Intersections, IEEE Trans. Intell. Transport. Syst. 1 (2000). Daniel J. Dailey, F. W. Cathey, S. Pumrin: An Algorithm to Estimate Mean Traffic Speed Using Uncalibrated Cameras, IEEE Trans. Intell. Transport. Syst. 1 (2000) Todd N. Schoepflin, Daniel J. Dailey, Dynamic Camera Calibration of Roadside Traffic Management Cameras for Vehicle Speed Estimation, IEEE Trans. Intell. Transport. Syst. 4 (2003) 90-98. D. Makris, T. Ellis, Path Detection in Video Surveillance, Image and Vision Computing, 20 (2002) 895-903. C. Stauffer, W. Grimson, Learning Patterns of activity using real-time tracking, in IEEE Trans. Patt. Anal. Mach. Intell. 22 (2000) 747-757. N. Johnson, D. Hogg, Learning the distribution of Object Trajectories for Event Recognition, Image and Vision Computing, 14 (1996) 609-15. A. Elgammal, R. Duraiswami, Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance, Proc. IEEE 90. M. A. Fischler, R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM, 24 (1981) 381-395. “Tracking and Kalman Filtering Made Easy”, Eli Brookner, John Wiley & Sons, (1998)
Robust Tracking and Object Classification Towards Automated Video Surveillance Jose-Luis Landabaso1, Li-Qun Xu 2, and Montse Pardas1 1
Technical University of Catalunya, Barcelona, Spain 2 BT Exact, Adastral Park, Ipswich, UK
Abstract. This paper addresses some of the key issues in computer vision that contribute to the technical advances and system realisation for automated visual events analysis in video surveillance applications. The objectives are to robustly segment and track multiple objects in the cluttered dynamic scene, and, if required, further classify the objects into several categories, e.g. single person, group of people or car. There are two major contributions being presented. First, an effective scheme is proposed for accurate cast shadows / highlights removal with error corrections based on conditional morphological reconstruction. Second, a temporal template-based robust tracking scheme is introduced, taking account of multiple characteristic features (velocity, shape, colour) of a 2D object appearance simultaneously in accordance with their respective variances. Extensive experiments on video sequences of variety real-world scenarios are conducted, showing very promising tracking performance, and the results on PETS2001 sequences are illustrated.
1
Introduction
Accurate and robust segmentation and tracking of multiple moving objects in dynamic and cluttered visual scenes is one of the major challenges in computer vision. It is particularly desirable in the video surveillance field where an automated system allows fast and efficient access to unforeseen events that need to be attended by security guards or law enforcement officers as well as enables tagging and indexing interesting scene activities / statistics in a video database for future retrieval on demand. In addition, such systems are the building blocks of higher-level intelligent vision-based or assisted information analysis and management systems with a view to understanding the complex actions, interactions, and abnormal behaviours of objects in the scene. Vision-based surveillance systems can be classified in several different ways, considering the environment in which they are designed to operate i.e. indoor, outdoor or airborne; the type and number of sensors; the objects and level of details to be tracked. In this paper our focus is on processing videos captured by a single fixed outdoor CCTV camera overlooking areas where there are a variety of vehicle and/or people activities. There are typically a number of challenges associated with the chosen scenario in the realistic surveillance applications environment: natural cluttered A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 463–470, 2004. © Springer-Verlag Berlin Heidelberg 2004
464
J.-L. Landabaso, L.-Q. Xu, and M. Pardas
background, repetitive background, illumination changes, occlusions, objects entries and exits, or shadows and highlights. Over the recent years there have been extensive research activities in proposing new ideas, solutions and systems for robust object tracking to address the above situations [1]. Most of them adopt the ‘background subtraction’ as a common approach to detecting foreground moving pixels, whereby the background scene structures are modelled pixel-wise by various statistically-based learning techniques on features such as intensities, colours, edges, textures etc. The models employed include parallel unimodal Gaussians [2], mixture of Gaussian [3], nonparametric Kernel density estimation [4], or simply temporal median filtering [5]. A connected component analysis (CCA) [6] is then followed to cluster and label the foreground pixels into meaningful object blobs, from which some inherent appearance and motion features can be extracted. Finally, there is a blob-based tracking process aiming to find persistent blob correspondences between consecutive frames. In addition, most application systems will also deal with the issues of object categorisation or identification (and possibly detailed parts analysis) either before [7] or after [5] the tracking is established. Regarding the matching method and metric, the heterogeneous nature of the features extracted from the 2D blobs has motivated some researchers to use only a few features, e.g. the size and velocity in [8] for motion correspondence, and the size and position with Kalman predictors in [3]. Others using more features decide to conduct the matching in a hierarchical manner, for instance, in the order of centroid, shape and colour as discussed in [5]. Note that if some domain knowledge is known, e.g., the type of an object to be tracked being a single person, then more complex dynamic appearance models of the silhouettes can be used [7]. Also, in [4] special probabilistic object appearance models have been used to detect and track individual persons who start to form a group and occlude each other [9]. In this paper we describe a robust multi-object tracking and classification system in which several novel ideas are introduced. These include the use of false foreground pixels suppression; the cast shadows / highlights removal; and the matching process using the scaled Euclidean distance metric in which a number of features characterising a foreground object are used simultaneously, taking into account the scaling and variance of each of the features. The method is not only very accurate, but also allows an easier inclusion of other extracted features, if necessary, leaving room for future enhancement. The system also further incorporates a classification module to classify each persistently tracked object, based on the analysis of local repetitive motion changes within the blob representation over a period of time. Figure 1 depicts schematically the block diagram of our object tracking and classification system. The paper is structured as follows. In the next section the techniques for pixeldomain analysis leading to the segmented foreground object blobs are described. Section 3 is devoted to discussion on issues concerning robust object tracking, including the use of temporal template; the matching procedure, and the object entries and exits. Section 4 describes the object classification approach adopted.
Robust Tracking and Object Classification
465
Fig. 1. The system block diagram showing the chain of functional modules.
Section 5 illustrates the experimental evaluations of the system. And finally, the paper concludes in Section 6.
2
Moving Objects Segmentation
The first issue to solve in the chain of the proposed surveillance system is the segmentation of those image pixels that do not belong to the background scene. As in [8] the adaptive background subtraction method proposed by Stauffer and Grimson [3] is adopted. A mixture of K Gaussian distributions is used to model RGB colour changes, at each pixel location, in the imaging scene over the time. With each incoming frame the Gaussian distributions are updated, and then used to determine which pixels are most likely to result from a background process. This model allows a proper representation of the background scene undergoing slow lighting and scene changes as well as momentary variations such as swaying trees / flags with winds. The foreground pixels thus obtained, however, are not exempt from false detections due to noise in the background and camera jitters. A false-foreground pixels suppression procedure is introduced to alleviate this problem. Basically, when a pixel is initially classified as a foreground pixel, its 8-connected neighbouring pixels’ models are examined. If the majority of these models, when applied to this pixel, agree that it’s a background pixel, then it’s considered as a false detection and removed from foreground.
Fig. 2. (a) A snapshot of a surveillance video sequence, the cast shadows from pedestrians are strong and large; (b) the result of initial foreground pixels segmentation, the moving shadows being included; (c) The “skeleton” image obtained after the shadow removing processing; and (d) the final reconstructed objects with error corrections.
466
J.-L. Landabaso, L.-Q. Xu, and M. Pardas
Once the foreground objects pixels have been identified, a further scheme is applied to find out if some of these foreground pixels correspond to areas likely to be cast shadows or specular reflections. The working mechanism of this novel scheme is the following: As the first step, a simplified version of the technique discussed in [10] is used to evaluate the variability in both brightness and colour distortion between the foreground pixels and the adaptive background, and possible shadows and highlights are detected. It was observed though that this procedure is less effective in cases that the objects of interest have similar colours to that of presumed shadows. To correct this, an assertion process comparing the gradient / textures similarities of the foreground pixels and corresponding background is incorporated. These processing steps effectively removing cast shadows also invariably delete some object pixels and distort object shapes. Therefore, a morphology-based conditional region growing algorithm is employed to reconstruct the object’s shapes. This novel approach gives favourable results compared to the current state-of-the-art to suppress shadows / highlights. Figure 2 illustrates an example processing result.
3
Robust Objects Tracking
After the cast shadows / highlights removal procedure, a classical 8-connectivity connected component analysis is performed to link all the pixels presumably belonging to individual objects into respective blobs. The blobs are temporally tracked throughout their movements within the scene by means of temporal templates.
3.1
Temporal Templates
Each object of interest in the scene is modelled by a temporal template of persistent characteristic features. In the current studies, a set of five significant features are used describing the velocity at its centroid the size, or number of pixels, contained the ratio of the major-axis vs. minoraxis of the best-fit ellipse of the blob [11]; the orientation of the major-axis of the ellipse and the dominant colour representation using the principal eigenvector of the aggregated pixels’ colour covariance matrix of the blob. Therefore at time we have, for each object centred at a template of features There are two points that need special clarification as follows: a) Prior to matching the template with a candidate blob in frame centred at with a feature vector Kalman filters are used to update the template by predicting, respectively, its new velocity, size, aspect ratio, orientation in The velocity of the candidate blob is calculated as
Robust Tracking and Object Classification
467
b) Instead of we use or the value of 1.0, to denote the dominant colour of the template, and to represent the colour similarity between the template and the candidate blob
It is only after a match (in section 3.2) is found that the template’s dominant colour is replaced with that of the matched candidate. The mean and variance vector of such a template are updated when a candidate blob in frame is found to match with it. And they are computed using the latest corresponding L blobs that the object has matched, or a temporal window of L frames (e.g., L = 50). With regard to individual Kalman filters they are updated only by feeding with the corresponding feature value of the matched blob.
3.2
Matching Procedure
We choose to use a parallel matching strategy in preference to the serial matching one such as that used in [5]. The main issue now is the use of a proper distance metric that best suits the problem under study. Obviously, some features are more persistent for an object while others may be more susceptible to noise. Also, different features normally assume values in different ranges with different variances. Euclidean distance does not account for these factors as it will allow dimensions with larger scales and variances to dominate the distance measure. One way to tackle this problem is to use the Mahalanobis distance metric, which takes into account not only the scaling and variance of a feature, but also the variation of other features based on the covariance matrix. Thus, if there are correlated features, their contribution is weighted appropriately. However, with high-dimensional data, the covariance matrix can become noninvertible. Furthermore, matrix inversion is a computationally expensive process, not suitable for real-time operation. So, in the current work a scaled Euclidean distance, shown in (1), between the template and a candidate blob is adopted, assuming a diagonal covariance matrix. For a heterogeneous data set, this is a reasonable distance definition.
where the index runs through all the features of the template, and is the corresponding component of the variance vector Note especially that for the colour component, is assumed for the object and for the candidate blob
3.3
Occlusions Handling
In the current approach, no use is made of any special heuristics on the areas where objects enter/exit into/from the scene. Objects may just appear or disappear in the middle of the image, and, hence, positional rules are not necessary.
468
J.-L. Landabaso, L.-Q. Xu, and M. Pardas
To handle occlusions, the use of heuristics is essential. Every time an object has failed to find a match with a candidate blob, a test on occlusion is carried out. If the object’s bounding box is overlapped with some other object’s bounding box, then both objects are marked as ‘occluded’. This process is repeated until all objects are either matched, marked as occluded, or removed after missing for MAX_LOST frames. As discussed before, during the possible occlusion period, the object template of features are updated using the average of the last 50 correct predictions to obtain a long-term tendency prediction. Occluded objects are better tracked using the averaged template predictions. In doing so, small erratic movements in the last few frames are filtered out. Predictions of positions are constrained within the occlusion blob. Once the objects are tracked, the classification challenges can be addressed.
4
Object Classification
The goal is to classify each persistently tracked object as being a single person, a group of people or a vehicle. The procedure employed is based on evaluating internal motion within the tracked object blob over T consecutive frames, which is similar to that discussed in [8]. First, a translation and scale compensations of the object over time is needed. Translation is done by using a bounding box centred on the tracked object. The bounding box is then resized to a standard size to compensate for scale variations. Second, the internal motion is computed as the blob changes in consecutive frames using the XOR operator followed by accumulating these changes over the last T frames: Finally, all corresponding to the pixels in the top and bottom section of the object are added together (2), considering that the only repetitive movement observed for walking persons are in the top (arms), and bottom (legs) sections.
where X and Y are the width and height of the scale-compensated object blob.
Fig. 3. for a group of persons and a car. is depicted in grey scale with white values denoting higher motion. The left image shows much higher internal repetitive movements, especially in the upper and bottom sections.
Robust Tracking and Object Classification
469
At this point, a threshold can be defined. An object is identified as nonrigid moving object such as a person or a group of people if its value is above the threshold; otherwise it is classified as a vehicle. The choice of the threshold depends on T. In our tests a threshold of 1 proved to classify most of the objects correctly when using a value of T = 50 (2 secs. at 25 fps).
5
Experimental Results
The system has been extensively evaluated in several scenarios and conditions, with, among others, the benchmarking video sequences provided by PETS 2001. Original testing images are compressed in JPEG format, and we have used subsampled versions of size 384 × 288. Apart from the JPEG compression artefacts, the sequences also contain a few other difficulties, including thin structures, reflections, illumination changes, swaying leaves in trees and window reflections in outdoor scenarios, shadows, etc. The system has dealt with all these problems successfully, and handles well with the occlusion situations, when the movement of the blobs is easily predictable, as in figure 4.
Fig. 4. An example illustrating one difficult tracking situation: a white van is occluded by a thin structure (a street light pole) and a group of people is largely blocked by the van for a few frames. These and other tracking results are accessible to view at URL: http://gps-tsc.upc.es/imatge/_jl/Tracking.html
Problems occur when a few individually moving objects join each other and form a group. These objects are correctly tracked within the limit of pre-defined MAX_LOST frames as if they were occluding each other. Beyond the limit the system creates a new template for the whole group. Other problems may appear when objects abruptly change their motion trajectories during occlusions: sometimes the system is able to recover the individual objects after the occlusion, but on other occasions new templates are created. Regarding shadows and highlights they are handled correctly in most cases, though very long cast shadows may not be completely removed sometime. Finally, objects are correctly classified for over 80% of the frames, using the majority voting classification result via a slide window of W frames, e.g. W = 50.
6
Conclusion
In this paper, we have presented a robust vision-based system for accurate detection, tracking as well as categorical classification of moving objects in outdoor
470
J.-L. Landabaso, L.-Q. Xu, and M. Pardas
environments surveyed by a single fixed camera. Each foreground object of interest has been segmented and shadow removed by an effective framework. The 2D appearances of detected object blobs are described by multiple characteristic cues. This template of features are used, by way of scaled Euclidean distance matching metric, for robust tracking of the candidate blobs appeared in the new frame. In completing the system we have also introduced technical solutions to dealing with false foreground pixels suppression, temporal templates adaptation, and have discussed briefly the issues for object classification based on motion history. Experiments have been conducted on real-world scenarios under different weather conditions, and good and consistent performance has been confirmed. Future work includes resolving the difficult problems of individual moving objects joining-separating-joining by using more persistent appearance modelling; multi-camera cooperative tracking and occlusion handling. Acknowledgments. This work was performed at the Content and Coding Lab, BT Exact, UK, where JLL was supported by a BT Student Internship, in connection with the EU Framework V SCHEMA project (IST-2000-32795). JLL also acknowledges the bursary from Spanish national project grant number TIC20010996.
References 1. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence (1997) 2. Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Detection and location of people in video images using adaptive fusion of color and edge information. Proceedings of International Conference on Pattern Recognition (2000) 3. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2000) 4. Elgamal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of IEEE (2002) 5. Zhou, Q., Aggraval, J.: Tracking and classifying moving objects from video. Proceedings of Performance Evaluation of Tracking and Surveillance (2001) 6. Horn, K.: Robot Vision. MIT Press (1986) 7. Haritaoglu, Harwood, D., Davis, L.: W4: Real time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence (2000) 8. Javed, O., Shah, M.: Tracking and object classification for automated surveillance. Proceedings of European Conference on Computer Vision (2002) 343–357 9. McKenna, S., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking groups of people. Proceedings of Computer Vision and Image Understanding (2000) 10. Horpraset, T., Harwood, D., Davis, L.: A statistical approach for real-time robust background subtraction and shadow detection. Proceedings of International Conference on Computer Vision (1999) 11. Fitzgibbon, A., Fisher, R.: A buyer’s guide of conic fitting. Proceedings of British Machine Vision Conference (1995) 513–522
Detection of Vehicles in a Motorway Environment by Means of Telemetric and Visual Data Sonia Izri, Eric Brassart, Laurent Delahoche, Bruno Marhic, and Arnaud Clérentin IUT d’Amiens, Département informatique Université de Picardie Jules–Verne, Avenue des facultés, 80025 Amiens CEDEX
[email protected], {Eric.Brassart, Laurent.Delahoche, Bruno.Marhic, Arnaud.Clérentin}@u-picardie.fr
Abstract. In this article we propose a multi-sensor solution allowing vehicles in a motorway environment to be detected. The step which we propose contributes to the improvement of the road safety by integration of safety devices within the vehicle. We take a system of original perception as support, composed of a rangefinder laser and of a sensor of omnidirectional vision. On one hand we show the results obtained for each sensor in terms of recognition of vehicle and on the other hand the interest to associate these two modules of processing of data for a reliable and effective management of the alarm to be activated.
1 Introduction The projects concerning the assistant to road safety by integration of so called ‘intelligent’ sensors are extremely numerous [CARSENSE, RADARNET, DENSETRAFFIC, EAST-EEA, etc.] and correspond to an increasing perceptible need in our daily lives. This problematic is strongly encouraged, moreover, by the car manufacturers, the drivers of vehicles, in political circles, the medical profession and the associations. Whatever the level of integration of the security systems, whatever the displayed functional architecture, the recurring problem remains the quantity of information to process with regard to the dynamic involvement. So the synchronization of processes, processing times and the real-time acquisition are so many constraints which make the final goal difficult to reach. In this context we will obviously try to privilege ‘ solution sensors ’ which will allow us either to obtain a maximum of information in one single acquisition on pre-processed information, which furthermore can be easily embarked on a vehicle. The systems of omnidirectional vision [5], [6] are, as such, very interesting us to use because when they are embarked on a vehicle, they allow us to discover dangers over 360 degrees and this in a single acquisition. The integration of such a sensor on a vehicle for the detection of the dangers is original because it has not yet been done to this day. Beside the visual data, we use another system of exteroceptive perception in the project, which is the telemetric laser. This latter allows us to obtain clusters of points stemming from telemetric measures, on which we try to identify the objects present in the image. The solution which we propose in this article is part of a project entitled SAACAM ( Systèmes Actif d’Aide à la Conduite pour AMéliorer la sécurité automobile : Active Systems of Driving Assistance for Improvement of Motorcar A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 471–480, 2004. © Springer-Verlag Berlin Heidelberg 2004
472
S. Izri et al.
Safety) which is contracted within the framework of the DIVA (DIagnostic et Véhicules Avancés : DIagnosis and Advanced Vehicles) regional pole of the Picardy region. The project integrates two essential parts which consist of the detection of situations connected with the road configuration, being possibly able to create a danger (crossroads, reductions in traffic lanes, speed limit, etc.) by using a SIG (Système d’Informations Géographiques : Geographic Information System) system coupled with a differential localisation GPS. This would ally more with longitudinal detection, and on the other hand the detection of dangers connected to traffic lanes by analysis of the environment close to the vehicle SAACAM (lateral dangers). In this article, we will be interested only in the second types of danger, by proposing a solution of exteroceptive sensor, based on the telemetric sensor and on the omnidirectional vision [5] [6]. This first type of sensor, will allows us to obtain a distance information of objects around our vehicle. The second type of sensor is interesting because it allow detection of the close dangers over 360° in a panoramic view in a single acquisition. We will detail at first the parts MPLD1 and MPOI2 (see figure 1) characterising the process of the telemetric data and the omnidirectional images. We will present the principles set up for the extraction of signatures of vehicles in the telemetric data and of visual indications for the detection of vehicles in the unwrapped “sub-images”. In the last paragraph, we will give the perspectives for the continuation of our work which are characterised by the modules MDF3 and MFT4 of the figure 1.
Fig. 1.
1
Module Module 3 Module 4 Module 2
General plan of functioning of the exteroceptive sensors
of Processing of Omnidirectional Images of Processing of Laser Data of Data Fusion of Follow-up of Tracks
Detection of Vehicles in a Motorway Environment
473
2 Telemetric Data Processing The telemetric data are the product of the rangefinder laser SICK LMS 200, placed behind the vehicle SAACAM for raisons of safety and legislation. For every acquisition we obtain a series of points which allows a flat ‘image’ over 180° of the environment behind our vehicle to be returned (road structure and following vehicles). The acquisition of these data and their processing are respectively represented by the sensor ‘laser data’ and the MPLD block of the figure 1.
2.1 The “Clustering” Our method of ‘clustering’ allows us to group together successive points in the scanning of the telemetric according to a criterion of distance (distance separating a point N from its immediate neighbour N+1 and which should be lower than a threshold) [9] [10]. In the following example, we show, by frames, what we should obtain at the conclusion of this stage. The alignment of points is characteristic of the detection of forms belonging to objects. When the continuance is not respected, it is thus necessary to create a separation of clusters.
Fig. 2. Extracted of an image with the identification of the characteristic objects
2.2
The Segmentation
This segmentation allows the quantity of information to be processed in ‘clusters’ to be reduced by replacing the alignment of points by straight lines. By ‘cluster’, we can obtain one or more segments which will to correspond to more or less complicated objects. The algorithm used resumes the principle of Duda-Hart [8] which consists of repeatedly grouping together the sets of points aligned with a criterion of distance/straight-line support. The algorithm is stopped when there are no more points to verify the condition of distance.
Fig. 3. Segmentation of “clusters ”
474
S. Izri et al.
2.3
The Filtering
Three types of filtering were set up so as to : eliminate the ‘small’ clusters according to an inversely proportionate criterion of distance from the impact of its laser source, fusion aligned straight lines which will allow the collection of straight lines, to which one of the extremities is ‘close’ and the adaptation of which does not exceed an angle of between 0 and ±1 0 ° (figure 4), fusion orthogonal straight lines where straight lines belonging to consecutive clusters have a ‘close’ extremity and an orientation of 90° of ± 5°, (figure 5).
Fig. 4. Fusion of two segments
Fig. 5. Orthogonality of two segments belonging to two different clusters
2.4 Vehicles’ Recognition After a study of 2D vehicle signatures in our data, it turns out that vehicles can be identified in 3 different manners: a straight line, when the perpendicular in the latter passes through the point of emission of the laser beam, two perpendicular straight lines with their obtuse angle to the left when the vehicle detected is in the section [0°, 90°] of the telemetric sensor, two perpendicular straight lines with their obtuse angle to the right when the vehicle detected is in the section [90°, 180°] of the telemetric sensor.
3 Omnidirectional Data The acquisition of the images and the omnidirectional data processing are represented in figure 1 by the sensor ‘omnidirectional image’ and the block MPOI.
3.1
Acquisition of Images
The acquisition of these omnidirectional images is achieved through an exteroceptive specific sensor of the ACCOWLE5 company in Japan. It consists of a convex mirror of spherical type (figure 6), placed on a cylindrical glass support, and of a black needle. This device is installed on a camera FIXED-TERM CONTRACT SONY EVI 330 [Colour Camera 768x576 x 24 bits]. This sensor allows to obtain a sight on 360° 5
http://www.accowle.com/
Detection of Vehicles in a Motorway Environment
475
from the environment in a single acquisition, being able to discover vehicles evolving in the circle of acquaintances of the car SAACAM as well as the markings in the ground characterising the way of traffic.
3.2
Image Processing
3.2.1 Application of Masks The image obtained with the spherical sensor is not exploitable in its totality and the masking operation of these image zones allows us to save on processing time. To do this, two masks were applied to each of these images: an internal mask avoiding the processing of pixels associated to our vehicle SAACAM, (roof of the car and the sensor) an external mask eliminating the visualization of the external crown of the sensor and the reflections of the glass support.
Fig. 6.
Prototype of the omnidirectional spherical sensor of ACCOWLE
Fig. 7. Omnidirectional masked image
This pre-processing allows the quantity of information to be reduced by 40%, during the project. The eliminated zones present no interest in the detection of characteristic objects.
3.2.2 Extraction of ‘Sub-images’ Still with the aim of accelerating image processing, we favoured the quantity of information to be processed by undertaking our research for landmark characteristics only on the zones favoured in our omnidirectional image. These zones correspond to:
476
S. Izri et al.
the section of road just in front of our vehicle for the detection of the vehicles ahead, the section behind to allow the tracking of following vehicles, the left rear section to detect a vehicle commencing a change of lane with the aim of overtaking, or the arrival of a vehicle positioned on the left and in the process of overtaking, the right-hand front section during lane changing for the overtaking of the vehicle in front. The extraction of these ‘sub-images’ is made on one hand from four characterising parameters with regard to the centre of the image, the minimum and maximum radius of the omnidirectional image (see figure 7), and on the other hand the starting and finishing angles at which this extraction is made, which corresponds respectively to the height and the width of the ‘sub-image’. So as to have a more humanely realistic interpretation of these portions of images we made a bilinear transformation [5]. The result of ‘sub-images’ with their bilinear interpolation is given in figure 8. In the case where a vehicle is detected as being an overtaken vehicle or when my vehicle is being overtaken, a lateral zone of 180° on the side where the detection is made is automatically extracted from the omnidirectional image to assure the tracking of one or more vehicle(s). (See figure 9).
Fig. 8. Aspect of the image for the definition of a ‘sub-image’
Detection of Vehicles in a Motorway Environment
477
Fig. 9. Representation of the right hand side of the road scenario following detection of overtaking
3.3
Detection of Vehicles
3.3.1 Modelling by Active Contours To detect the vehicles present in the resulting ‘sub-image’ we chose to model them by snakes or active contours. A snake [2] is an elasticised curve which can be modelled by a parametric shape normalised as follows:
where, s is the curvilinear abscissa or the parameter on the curve in the spatial domain v(s) is the vector of position of the point of contour of coordinates x (s) and y(s), v(1) and v(0) are the vectors of position of the extremities of the contour. The total energy of the contour for which we try to minimize is represented by the following function :
Where represents the internal energy of the snake, is the energy derived of the image (contours, gradients) and represents the energy of constraints. The internal energy two terms:
It is intrinsic in the snake, it decomposes into
The first term is the first by-product of v(s) with regard to, influenced by which controls the tension (elasticity) of the contour and the second term which is the second by-product of v(s), influenced by which controls the rigidity of the contour.
478
S. Izri et al.
The external energy It depends on characteristics of the image. It is the force which steers the contour towards the position wished in the image. One of the most used forces is relative that to the pressure gradient of the image defined by :
Energy of constraints It is defined by the user, represents high-level constraints considered relevant to increase the precision during the segmentation. 3.3.2 Greedy Algorithm An algorithm for the active contour model using dynamic programming was proposed by Amini and al.[11] . The approach is numerically stable. It also allows inclusion of hard constraints in addition to the soft constraints inherent in the formulation of the functional. Williams and Shah [3] pointed out that some of the problems with the approaches and proposed the Greedy algorithm. The algorithm is faster than Amini’s algorithm, being O(nm) for a contour having n points which are allowed to move to any point in a neighbourhood of size “m ” at each iteration. In the function used in the algorithm is :
The form of the equation is similar to Eq.(1). The first terms is first-order continuity constraints and the second term is the second-order continuity constraints. They correspond to in Eq.(1). The last term is the same as image force in Eq.(1). No term for external constraints was included here. In this project, we use an implementation [4], that was based on the formulation (greedy algorithm) proposed by Williams and Shah. Pseudo-code for Greedy Algorithm
Detection of Vehicles in a Motorway Environment
479
Once the energies of every point of the neighbourhood are calculated, we will decide to which pixel of the image the point of the snake should migrate (towards the point of the neighbourhood the sum of the 3 energies of which is the weakest) as we can see in (figure 10). This is the way the places of points in the neighbourhoods are coded.
Fig. 10. Possible movements of a point in its neighbourhood
3.3.3 Experimental Results The results are shown in the figure 11 with the reference image. The left part corresponds in search of the zones of initialisation of snakes. This process takes on average 40 ms and are applied every four images. The right images of the figure 11 show the terminal states for the iterations of snakes in the images 2D. The iterations are combined on the three plans of colour ( RVB) and this process takes on average 210 ms for 500 iterations.
4
Conclusion and Perspectives
We have presented a method allowing us to identify vehicles in real time from visual and telemetric information. The results obtained are relatively stable. Primitives extracted from the telemetric and visual observations are largely complementary, but also, to a lesser degree, superfluous. These characteristics are very interesting because they guarantee a high level of reliability when they are merged. This is precisely the next step of our study. It will consist of taking the concepts of association and combination, known as symbolism, as support. The use of Demspter Shafer’s [12] theory will be favoured in this framework because it is very well adapted. This formalism will allow us to manage and to propagate the notion of uncertainty in the entire processing sequence, of which the final stage will be the estimation of the state of the nearby vehicles. We have already been developing the innovative concept with regard to this problem for several years [6] [7] and we wish to adapt them and to apply them to the safety system presented in this article.
480
S. Izri et al.
Fig. 11. Results of a detection of the surrounding vehicles
References Laurent Cohen. “Modèles déformables”, CEREMADE, URA CNRS749, Université Paris9 Dauphine. 2. M. Kass, A. Witkin and D. Terzopoulos. “Snakes: Active contour models”. Proc. 1st Int. Conference on Computer Vision, London, 1987, pp. 259-268. 3. Donna J. Williams and Mubarak Shah - “A Fast Algorithm For Active Contours and Curvature Estimation” – Image Understanding, Vol55, N°l, January 1992, pp14-26. 4. Lavanya Viswanathan – « Equations For Active Contours », Novembre 1998 5. Cyril Cauchois.“Modélisation et Calibration du Capteur Omnidirectionnel SYCLOP: Application à la Localisation Absolue en Milieu Structuré”. Université de Picardie Jules Verne. Déc. 2001. 6. Arnaud Clérentin. “ Localisation d’un robot mobile par coopération multi-capteurs et suivi multi-cibles”, Université de Picardie Jules Verne. Décembre, 2001. 7. Arnaud Clérentin, Laurent Delahoche, Eric Brassart, Cyril Cauchois – “Mobile robot localization based on multi target tracking” - proc. of the IEEE International Conference on Robotics and Automation (ICRA 2002), Washington, USA, Mai 2002. 8. J. Crowley, “World modelling and position estimation for a mobile robot using ultrasonic ranging”, Proc. of IEEE Conf. on Robotics and Automation, Scottsdale, May 1989, p. 674680. 9. Sonia Izri, Eric Brassart, Laurent Delahoche, “Détection d’Objets dans des Images Omnidirectionnelles : Application en Milieu Autoroutier” . CNRIUT Tarbes 15 et 16 Mai 2003 10. Sonia Izri, Eric Brassart, Laurent Delahoche, Arnaud Clérentin, “Détection de Véhicules dans un Environnement Autoroutier à l’aide de Données Télémétriques et Visuelles”. Majecstic Marseille 29, 30 et 31 octobre 2003. 11. A.A. Amini, T. E. Weymounth, and T. C. Jain, “Using Dynamic Programming for Solving Variational Problems in Vision”, IEEE Trans on Pattern Analysis and Machine Intelligence, vol. 12, no. 9, September 1990 12. G. A. Shafer, “A mathematical theory of evidence”, Princeton university press, 1976. 1.
High Quality-Speed Dilemma: A Comparison Between Segmentation Methods for Traffic Monitoring Applications Alessandro Bevilacqua2, Luigi Di Stefano1,2, and Alessandro Lanza2 1
2
Department of Electronics Computer Science and Systems (DEIS) University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Advanced Research Center on Electronic Systems ‘Ercole De Castro’ (ARCES) University of Bologna, Via Toffano 2/2, 40125 Bologna, Italy {abevilacqua, ldistefano, alanza}@arces.unibo.it
Abstract. The purpose of traffic monitoring applications is to identify (and track) moving targets. The system performance strongly depends on the effectiveness of the segmentation step. After having thresholded the image resulting from the difference between a reference background and the current frame, some morphological operations are required in order to remove noisy pixels and to group correctly detected moving pixels into moving regions. Mathematical and statistical morphology allow to develop commonly used morphological techniques which meet the real time requirements of the applications. However, when developing segmentation methods the right trade-off between quality and time performance should be taken into considerations. This work offers some guidelines which help researchers to choose between different segmentation techniques which are characterized by a higher quality and a higher performance, respectively. An extensive experimental Section dealing with indoor and outdoor sequences assesses the reliability of this comparison.
1
Introduction
To have correctly segmented moving blobs (a sort of coherent connected regions, sharing common features) represents a key issue in all the traffic monitoring systems. In fact, a weak segmentation step could affect the subsequent stages of features extraction and tracking. Therefore, choosing the proper segmentation method represents a crucial task. When designing and developing traffic monitoring systems the image processing operations employed should be the most effective among those which meet soft real time requirements. However, finding a good tradeoff between time performance and quality of results is a challenging task which afflict all the systems designers. This paper considers two different segmentation methods whose performances are outlined in terms of quality of the attained result versus speed needed to obtain it. The first method consists of the approach we have devised ([1]). The novel morphological operation we setup takes advantage of all the true signals which come from a previous low threshold operation without being heavily afflicted by A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 481–488, 2004. © Springer-Verlag Berlin Heidelberg 2004
482
A. Bevilacqua, L. Di Stefano, and A. Lanza
the inevitable huge amount of noise. The purpose of this operator is to detect connected objects on the basis of the criterion that a given structure must fit inside the object. In particular, the decision of preserving a “structured” component is based on a measurement criterion which we called “the fitness” of the operator. By fixing some of the parameters of this method, a high speed segmentation algorithm is attained which pays for its efficiency in terms of quality. The second method derives from the widely utilized dilation and erosion operators originated from mathematical morphology. Although this method is slower to perform it usually attains a higher quality of the border of the detected moving regions, thus achieving more precise geometric and appearance-based properties of the detected objects. It is worth noticing that the lack in preserving the objects’ properties across frames could introduce a great amount of mismatches during the tracking phase. The segmentation techniques we are comparing in this paper have been used within the traffic monitoring system we have developed. In order to assess both the quality and the time performance of the two algorithms, different sequences are elaborated through using the overall system. This paper is organized as follows. In Sect. 2 we review some segmentation methods used within a few visual surveillance systems. In Sect. 3 we outline the overall motion detection system we have developed. In Sect. 4 a detailed description of the segmentation methods we are comparing is given. Experimental results are shown in Sect. 5 and Sect. 6 draws conclusions.
2
Previous Works
The system described in [2] works on a binary image stemming from a thresholding operation on a background difference image. First, one iteration of erosion is applied to remove one-pixel noise. Then a fast binary connected component operation allows to remove small regions and to find likely foreground regions, which are further enclosed by bounding boxes. In order to restore the original size of the objects, a morphological opening is applied. After reapplying a background subtraction and a size thresholding operation, a morphological closing is performed only to those regions which are enclosed by the bounding boxes. Authors met with great difficulties the right combination for the morphological operations and made this system result in a quite scene-dependent application. In [3,4] authors use three frame differencing until a background model is stabilized. After that, the background subtraction technique is used and a thresholding operation permits to obtain moving pixels. In all the cases, moving pixels are grouped by means of a connected component approach. Two iterations of morphological dilation and one erosion step are performed in order to reconstruct incomplete targets. Noise is removed by filtering the size of the pixels’ area. As a result, blobs are extracted with a rough definition. The research work described in [5] proposes an efficient solution that considers only the pixels of the edge of the binary objects, and then moves along this contour pixel by pixel, writing on the output image only those pixels of the SE which could not be reached at the precedent move (the non-overlapping pixels), or the entire SE at the beginning. This allows the coding of an efficient dilation. The erosion was obtained by dilating the complementary image. Authors
High Quality-Speed Dilemma
483
Fig. 1. The general scheme for the motion detection algorithm (a), a sample frame (b) and the thresholded background subtraction (c)
in [6] present an efficient algorithm well-suited for basic morphological operations with large arbitrary shaped SE’s. It exploits an original method based on histograms to compute the minimum (or the maximum) value of the image in the window defined by the structuring element in case of erosion (or dilation, respectively). However, this method improves the performance of the classical operations mostly in case of grey level images. The morphological segmentation presented in [7] is based on a new connected operator which uses morphological gray scale reconstruction to preserve the number of regions. Basically, the authors devised a new method which exploits both size and contrast in order to preserve small objects with a high contrast and to keep low contrasted regions separated, respectively. Although this method shows a high efficiency for gray level images it is useless for binary images.
3
The Overall Motion Detection System
The segmentation step we describe is referred to the traffic monitoring system we have developed. It relies on a stationary video camera or, at most, on a camera moving in a “step and stare” mode. The algorithm processes one frame at a time and it gives the segmented interesting moving blobs as the final output, which here are made of vehicles, humans, shadows or all of them. The outline of the overall motion detection algorithm is described in Fig. 1(a). After that the system has generated a background through a bootstrap procedure ([8]) and has performed the arithmetic subtraction between the generated background and the current frame (a sample is shown in Fig. 1(b)), a suitable threshold has to be chosen and applied in order to detect moving pixels. The output of this step, called background subtraction, consists of a noisy binary image (Fig. 1(c))
484
A. Bevilacqua, L. Di Stefano, and A. Lanza
Fig. 2. Structuring elements: basic (a), compound (b) and cell-based (c)
which retains most of the true moving pixels together with false signals due to noise, moving shadows and uninteresting moving objects, such as hedges or trees. These signals must be removed and the shape of interesting moving objects must be “extracted”. Removing these signals has been often called in the image processing community the False Positive Reduction (FPR) step. After that, pixels which survived the previous step are grouped into connected regions. As a matter of fact, these two steps are commonly performed during one scan of the input image. In the next Section we compare two different methods which can be used in order to obtain connected components stemming from thresholded background differences.
4
The Segmentation Methods
The first method taken into consideration has been described in [1]. It aims to give a measure of how much a pixel belongs to a structural windowed region around it, thus resulting in a very effective FPR step. In fact, blobs fitting a given compound structure can be “reconstructed” and in the meanwhile noise can be removed since it does not fit the same structure. Fig. 2 shows the basic structure (a) and the compound structure (b) we use. The latter is obtained by rotating the former by 90°, 180° and 270°. This is as to say that the basic structure is searched by considering every spatial arrangement. In addition to these two structures, we define a cell-based structure (Fig. 2(c)). It is built through stemming from the compound structure (b) the same as (b) has been built starting from (a). But (b) is symmetric; thus (c) is formed basically by the set of all possible occurrences of the compound structure. Namely, in the example of Fig. 2 the cell-based element (c) is composed by 9 compound (cell) elements (b), whose centers are the white circles plus the black circle. How does this method exactly work? In our implementation, all the pixels of the elements involved in (a), (b) or (c) are assigned “1”. In case of the basic structure (Fig. 2(a)), a logical AND between the pixel pointed by the circle and each one of its three neighbours is performed. The arithmetic sum of these three partial results represent the fitness of the pixel pointed by the circle (therefore, the fitness maximum value is 3). Further, a hard threshold on this fitness value allows the pixel to be assigned “1” or “0”; this occurs whether the fitness is greater or less than the threshold, respectively. In case of the compound structure (Fig. 2(b)), this procedure is accomplished for four times, one for each possible position of the basic element (a) within the compound element (b). Unlike what we have made before, the partial fitnesses computed for the pixels pointed by the white circles are summed to each other instead of being assigned to the pixel. Here, the fitness maximum value can go up
High Quality-Speed Dilemma
485
Fig. 3. The framework of the compared segmentation methods: (a) structural fitness ( S F ) , (b) classical morphological (CM)
to 3×4 = 12, in case of all the underlying image pixels hold “1”. The outcome of the threshold operation performed on the total amount of fitness is finally given to the pixel corresponding to the center of the structure (the black circle). At last, for the cell-based structure (Fig. 2(c)), first we compute the fitness for each cell and then the overall fitness is assigned again to the central pixel pointed by the black circle. Finally, the operator “switches on” pixels having a fitness greater then a fix threshold and “switches off” the ones characterized by a lower fitness. The second method exploits the classical opening and closing morphological operations. Practically speaking, there are several ways to perform morphological opening and closing operations. For example, the order they are performed is relevant as well as the preprocessing step one considers. Moreover, the threshold represents the most sensitive parameter even for this method, as expected. Finally, both the mask size and the threshold of the dilate and the erode operators are crucial in order to achieve good results. Fig. 3 shows a scheme for the first (a) and the second (b) method, where the “size filtering” represents the area-open operation.
5
Experimental Results
The overall motion detection algorithm has been written in C and works under Windows and Unix-based OS’s. The target PC is an AMD Athlon MP 1800+, 1 GB RAM. In order to give a reliable comparison of the two segmentation methods, we analyze five different sequences, coming from both outdoor and indoor environments. In particular, indoor scenes show natural as well as artificial illumination while outdoor sequences are taken from a diurnal cluttered traffic scene and show varying illumination conditions. The main purpose of these experiments consists in giving a researcher some guidelines in order to choose between two different segmentation methods which emphasize speed performance and quality of results, respectively. In the left column of Fig. 4 we show the input binary images of the segmentation module referring to five representative frames of the correspondent sequences. The images have been attained after the background subtraction by varying the threshold We have chosen to fix a low value for in order to retain most of the true signal, addressing to
486
A. Bevilacqua, L. Di Stefano, and A. Lanza
Fig. 4. Input binary images (left column) and output of the two compared methods: Structural Fitness (middle column) and Classical Morphological (right column)
High Quality-Speed Dilemma
487
the segmentation module the task of removing false signal. In regard to the first segmentation method, just the basic element shown in Fig. 2(b) has been considered throughout all the experiments. Actually, even though different structural elements have been tested, that basic element shows general purpose properties. Fundamentally, once the basic SE has been defined, three more parameters need to be set. The first is the size of the cell-based element; the second is the threshold for the fitness. The first parameter is strictly related to the threshold applied to the background difference operation. Practically speaking, the size of the cell-based element determines the minimum value of that leads the possible detected false blobs not to be comparable in size with the smallest true blobs we want the system to reveal. The mask size has been fixed at 3×3 in order to achieve a better time performance. The last parameter to tune is related to the size filtering operation which aims to remove false blobs having the area below a prefixed threshold As for the second method, the experiments have been accomplished by varying five parameters: the threshold for the size filtering operation in the preprocessing step, the order in which the opening and the closing operations are performed, the thresholds of the basic dilate and erode operations and the same threshold tuned at the end of the first method. Actually, choosing quite a relaxed value for enables us to always perform the opening operations after the closing one. Therefore, the two methods share two thresholds, as shown in Table 1 (common). Table 1 also shows that even though
can range from 0 to 12, three subsequent values are enough to deal with five different sequences. On the other side, two different couples of values for and are enough in order to achieve good quality results. and represent thresholds which aim to remove false blobs and uninteresting moving objects. They are defined on the basis of the sensitivity a researcher wants for the system. Table 1 outlines that we have a lower for the indoor sequences than for the outdoor ones This choice relies on the fact that indoor sequences are characterized by a minor amount of noise. After having tuned parameters of both methods for five different sequences, we can state that this task is equally easy in both cases. From Table 2 it is possible to notice that the time required by the classical morphological method is roughly twice than the one required by the structural fitness method. This corresponds to an improvement in terms of frame rate achieved by the overall system of more than 10%. It is worth noticing that in the case of the last sequence the execution time is greater
488
A. Bevilacqua, L. Di Stefano, and A. Lanza
in both segmentation methods. In this case the decrease of the performances is mainly due to the presence in the sequence of blobs characterized by a larger area (Fig. 4(m,n,o)) than the ones in the other sequences. In order to achieve such a performance in term of speed, the first method pays for it and the price is in terms of quality, as seen in the middle and the right columns of Fig. 4. In fact, the classical morphological method yields blobs having smooth borders, while the other method attains a much more jagged border. Even though in case of the objects we have analyzed this could not be a problem, having jagged borders leads more imprecise measure for features such as perimeter and compactness. This imprecision could result in troubles for the further tracking operations.
6
Conclusions
In this work two segmentation methods utilized within the traffic monitoring application we have developed are compared. The first method we consider is the one we have devised. It allows to attain a high time performance paying for it in terms of quality of moving objects segmentation. On the contrary, the second segmentation algorithm exploits the well known morphological closing and opening operations thus allowing to attain higher quality segmented objects but achieving a lower frame rate. This work gives some guidelines which help a researcher to find a good tradeoff between the high quality of the segmented objects and the processing time utilized in order to attain such a result.
References 1. Bevilacqua, A.: Effective object segmentation in a traffic monitoring application. In: Proc. ICVGIP’02. (2002) 125–130 2. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Who? When? Where? What? a Real Time System for Detecting and Tracking People. In: Proc. FG’98. (1998) 222–227 3. Collins, R.T., Lipton, A.J., Kanade, T.: A System for Video Surveillance and Monitoring. In: Proc. Topical Meeting on Robotics and Remote Systems. (1999) 497–501 4. Kanade, T., Collins, R., Lipton, A.: Advances in Cooperative Multi-Sensor Video Surveillance. In: Proc. Darpa Image Understanding Workshop. (1998) 3–24 5. Vincent, L.: Morphological transformations of binary images with arbitrary structuring elements. Signal Processing 22 (1991) 3–23 6. Droogenbroeck, M.V.: Fast computation of morphological operations with arbitrary structuring elements. Pattern Recognition Letters 17 (1996) 1451–1460 7. Moon, Y.S., Kim, T.H.: Efficient morphological segmentation using a new connected operator. Electronics letters 36 (2000) 22–24 8. Bevilacqua, A.: A novel background initialization method in visual surveillance. In: Proc. MVA’02. (2002) 614–617
Automatic Recognition of Impact Craters on the Surface of Mars Teresa Barata1, E. Ivo Alves2, José Saraiva1, and Pedro Pina1 1
CVRM / Centro de Geo-Sistemas, Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal
{tbarata, jsaraiva, ppina}@alfa.ist.utl.pt 2
Centro de Geofísica da Universidade de Coimbra Av. Dias da Silva 3000-134 Coimbra, Portugal
[email protected]
Abstract. This paper presents a methodology to automatically recognise impact craters on the surface of Mars. It consists of three main phases: in the first one the images are segmented through a PCA of statistical texture measures, followed by the enhancement of the selected contours; in a second phase craters are recognised through a template matching approach; in a third phase the rims of the plotted craters are locally fitted through the watershed transform.
1 Introduction Mars is currently the target of intensive exploration programs, with no less than three probes in orbit, and more to come in the future. Craters stand out visually among the features on any planetary surface, but their true importance stems from the kind of information that a detailed analysis of their number and morphology can bring forth. Evaluating the density of craters on different areas of the planet has led to the establishment of a large-scale stratigraphy for Mars [1], a matter still under refinement, as coverage of the entire surface with better resolving power continues to become available. The study of craters can also improve our knowledge of the cratering mechanism itself, as of the characteristics of the materials targeted; furthermore, we can search for clues about the exogenous processes which cause the degradation of craters (with ejecta removal, ruin of walls and filling of floors) and play such an important role in defining the present character of the surface of Mars. Thus, it is only to be expected that craters are among the most studied of subjects when it comes to analysing planetary surfaces, and that the question of their automatic identification in images has been tackled in several studies. In many instances methods from one field are combined with others, in the search for the best answer. Though not an automated procedure, a word should be said about a NASA project, known as clickworkers, where laypeople were asked on the internet to mark the location of craters on grayscale images of the martian surface [2]. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 489–496, 2004. © Springer-Verlag Berlin Heidelberg 2004
490
T. Barata et al.
A voting technique, the Hough transform [3], plays an important part in several studies [4, 5, 6, 7, 8, 9, 10], though it has some computational complexity and has shown in some instances a lack of robustness. Template matching is at the core of other studies [11, 12]. Another approach relies on genetic algorithms to construct feature extraction algorithms specific to several types of superficial structures [13, 14]. Another study uses several different techniques simultaneously to obtain the detection of craters with different characteristics [15]. We believe there is room for another approach, such as the one we present here, and apply to optical images from the Mars Orbiter Camera (MOC) aboard the Mars Global Surveyor (MGS) probe. In it texture analysis plays an important role, aided by some mathematical morphology techniques in crucial steps of the methodology (contour filtering and local fitting). To actually detect the craters we employ a shape recognition algorithm to which is fed a pre-processed (segmented) image as free of noise as possible. This algorithm then produces an accurate result consisting of the location of the center of the crater and of its dimension (radius).
2 Image Segmentation Impact craters are characterised by a generally circular shape, with a wide variation of contrast to the background. Fig. 1 is a small sample of the variety of martian cratered terrains. These images were obtained by the MOC (wide angle) of the MGS probe, from the Hellas and Thaumasia regions, located in the southern martian hemisphere, with an average spatial resolution of 240 metres per pixel. Looking at Fig. 1 we can see that there are well defined, fresh craters, whose edges are regular and well contrasted (Fig. 1b), but there are also other craters, larger and/or older, with more irregular edges, not so well contrasted (Figs. 1a and 1c).
Fig. 1. Images from the martian regions of Thaumasia and Hellas: R0200575; (c) R0200962. NASA/JPL/MSSS
(a) E1900514; (b)
In order to automatically detect the craters it is first necessary to identify and delineate them, or to segment the image. Since the images we are using are all of a single spectral band for each geographical region (the red band of the wide-angle camera), one of the ways to segment the images is to use their texture, that is, to use texture measures [16, 17, 18]. A class of common texture measures, which we chose for this work, is based on the number of occurrences of each grey level within the proc-
Automatic Recognition of Impact Craters on the Surface of Mars
491
essing window (3x3 pixels). The following measures are obtained: mean, range, variance, entropy and skew. Fig. 2 shows the texture measures obtained for the image of Fig. 1a.
Fig. 2. Texture measures obtained for the image of Fig 1a: (a) Mean; (b) Range; (c) Variance; (d) Entropy; (e) Skew.
Among the measures obtained, the images of range and variance are those that best show the edges of the craters. Image segmentation could be applied directly to one of these images, but the results would contain too many small structures not corresponding to craters. To obtain less noisy binary images a Principal Components Analysis (PCA) was performed with the five texture measures. The results obtained are shown in the images of Fig. 3 for the first two axes, which retain 95% of the initial data.
Fig. 3. (a) First axis of PCA; (b) Second axis of PCA; (c) Thresholding of (a).
492
T. Barata et al.
The best image for the identification of craters is the one which represents the first axis of the PCA, since even the areas of the craters that were not so easily identifiable in the original image are clearly marked. Analysis of gray levels for this image reveals that the edges of the craters are characterized by high digital values. Therefore, the thresholding of the image leads to a binary image or mask, in which the black dots correspond to the edges of the craters, though some other unwanted structures still remain (Fig. 3c). That is why there is still the need to filter the image, as shown in Fig. 4.
Fig 4. Crater contour filtering and enhancement: (a) Closing Reconstruction.
(b) Erosion
(c)
A simple morphological sequence by erosion-reconstruction is applied [19]. The use of a closing with a structuring element B of size to the binary image X permits the reinforcement of clusters (Fig. 4a). The application of an erosion with a structuring element B of size to the previously closed image leads to the suppression of small and isolated regions (Fig. 4b). The geometry of the remaining regions is recovered by a reconstruction operation R (Fig. 4c). The complete sequence, where Y is the final result, is the following:
3 Crater Detection Two steps constitute this phase: a global template matching followed by a local fitting of detected contours on the real rims of the craters.
3.1 Global Matching This is made through the application of a crater recognition algorithm. Its current simple status was conceived in order to have a low sensitivity to noise, not to be too demanding on memory resources, and to be suited to fast implementations (in view of creating a base for autonomous landing guidance systems on very remote targets) [12].
Automatic Recognition of Impact Craters on the Surface of Mars
493
This algorithm is applied to the pre-processed binary images resulting from the previous segmentation phase. It searches the whole image for possible centres of craters, by counting the black pixels that are at a given distance away from each point considered and accepting it as a centre only if the number of those pixels is above a certain threshold a; the algorithm does this for diminishing values of the radius, until a minimum value is reached, under which the number of false detections would quickly rise. The maximum search radius was visually chosen on the set of images available, in order to be slightly wider than the largest crater found. If one wanted to fully automate the procedure it could be fixed to be equal to the length of the smallest side of the picture. An example of the application of this algorithm to the images of Fig. 1 (after segmentation) is presented in Fig. 5. In case a, 7 craters out of 8 were correctly recognized (87.50% of success), and we had 1 false recognition (corresponding to 12.50% of plotted craters). In case b, 7 true detections out of 10 craters present in the original image (70.00% of success), and 2 false positive results (22.22% of plotted craters). In case c, 5 true detections in 7 craters (71.43%) and 2 false craters detected (28.57%).
Fig. 5. Plotting of detected crater contours for the images of Fig. 1 (minimal radius 8, a equal to a quarter of the perimeter of the corresponding circumference).
3.2 Local Fitting The contours of the craters resulting from the previous global matching are perfect circumferences, simply plotted by indicating a centre and a radius. In general, these contours do not exactly follow locally the real rims of the craters. While this aspect is not significant for the global counting of craters and related size distributions when the intention is to determine relative terrain ages, on the contrary, it becomes an important issue when the evaluation of local geometric features of the craters is the objective. In order to have contours that follow the real rims of the craters we have developed an approach that uses the watershed transform [19]. It consists on computing the watershed using adequate markers (binary mask) to locally constrain the use of this segmentation operator. Our approach is illustrated with the image f of Figs. 6a and 6b. The binary mask is constructed from the union of the contour of the dilation of the circles X (hole-filling of the circumferences, Fig. 6c) and from the erosion
494
T. Barata et al.
of the same circles X. The equation synthesizes the sequence for the construction of the mask, which is presented in Fig. 6d. The application of the watershed WS on the morphological gradient grad of the initial image f (Fig. 6e), by imposing the marker set permits the creation of contours that follow the real rims of the craters: The differences occur naturally at a local level and can be verified by comparing Figs. 6b and 6f.
Fig 6. Local fitting: (a) Initial image (detail of R0200962); (b) Detected contours; filling; (d) Binary mask; (e) Morphological gradient; (f) Watershed lines.
4
(c) Hole-
Discussion and Future Work
We have applied our approach to a set of 26 images captured by the Mars Orbiter Camera aboard the Mars Global Surveyor probe during the mapping phase. These images cover approximately and were selected from different regions of the planet in order to cover the range of cratered terrains present on its surface. The global results can be seen in Table 1.
On average, 64.77% of the 264 craters with diameter larger than 2 km present on the 26 images were correctly detected, a result that can be regarded as very satisfactory, considering the differing characteristics of the areas under study. In addition, Fig. 7 illustrates the fact that there is no direct relation between the rate of success (detection of real craters) and the number of false detections (nonexistent craters which are plotted): the image on Fig. 7a has a below average success rate (58.33%) and a small number of false positives (36.36%), as seen by a comparison with Fig.7b, which shows all the plotted craters for this image; Fig. 7c has the best success rate of all 26 images (90.91%), and also a small number of plotted nonexistent craters (33.33%, 5 in 15 plotted), as comparison with Fig. 7d shows.
Automatic Recognition of Impact Craters on the Surface of Mars
495
Fig. 7. (a) Image R0200830; (b) Plotted craters for this image; (c) Image R0200837; (d) Plotted craters for this image. NASA/JPL/MSSS
One point that must be stressed is that the set of parameters involved in the application of the methodology remained fixed for all the 26 images processed. This is an unavoidable requirement when seeking to develop a truly automated procedure, where the human factor is absent. We feel nonetheless that there is much room for improvement in all phases of the methodology. Small modifications to the pre-processing phase can produce better images for the recognition algorithm, free of all sources of noise which greatly contribute for the plotting of non-existent craters. Likewise, the recognition algorithm can be improved in order to enhance accuracy in the location of craters. In the local fitting phase, we aim in the future for a better characterization of the true shape of a crater and its relations with age and exogenous processes. Meanwhile, the European probe Mars Express has started to collect images of Mars with better spatial and spectral resolution. We plan to use those data to consolidate and improve on our approach for the recognition of impact craters, and to pursue other objectives, such as the analysis not only of the crater, but also of the ejecta around it.
Acknowledgements. This paper results from research developed in the frame of the research project PDCTE/CTA/49724/03, funded by FCT (Portugal).
496
T. Barata et al.
References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12. 13. 14. 15.
16.
17. 18.
19.
Hartmann, W., Neukum, G.: Cratering Chronology and the Evolution of Mars. Space Science Reviews, 96 (2001) 165-194 Kanefsky, B., Barlow, N., Gulick, V.: Can Distributed Volunteers Accomplish Massive Data Analysis Tasks? Lunar and Planetary Science XXXII (2001) 1272 Illingworth J., Kittler J.: A Survey of the Hough Transform. Computer Vision, Graphics and Image Processing, 44 (1988) 87-116 Homma, K., Yamamoto, H., Isobe, T., Matsushima, K., Ohkubo, J.: Parallel Processing for Crater Recognition. Lunar and Planetary Science XXVIII (1997) 1073 Honda, R., Azuma, R.: Crater Extraction and Classification System for Lunar Images. Mem. Fac. Sci. Kochi Univ., 21 (2000) 13-22 Leroy, B., Medioni, G., Johnson, E., Matthies, L.: Crater Detection for Autonomous Landing on Asteroids. Image and Vision Computing, 19 (2001) 787-792 Costantini, M., Zavagli, M., Di Martino, M., Marchetti, P., Di Stadio, F.: Crater Recognition. Proc. IGARSS’2002 - International Geoscience & Remote Sensing Symposium (2002) Michael, G.: Coordinate Registration by Automated Crater Recognition. Planetary and Space Science, 51 (2003) 563-568 Flores-Méndez, A.: Crater Marking and Classification Using Computer Vision. In: Sanfeliu, A., Ruiz-Shulcloper (eds.): CIARP 2003. Springer, Berlin (2003) 79-86 Kim, J., Muller, J-P.: Impact Crater Detection on Optical Images and DEMs. Advances in Planetary Mapping (2003) Vinogradova, T., Burl, M., Mjolness, E.: Training of a Crater Detection Algorithm for Mars Crater Imagery. Proc. IEEE Aerospace Conference, Vol. 7 (2002) 3201-3211 Alves, E. I.: A New Crater Recognition Method and its Application to Images of Mars. Geophys. Res. Abs., 5 (2003) 08974. Brumby, S., Plesko, C., Asphaug, E.: Evolving Automated Feature Extraction Algorithms for Planetary Science. Advances in Planetary Mapping (2003) Plesko, C., Brumby, S., Asphaug, E., Chamberlain, D., Engel, T.: Automatic Crater Counts on Mars. Lunar and Planetary Science XXXV (2004) 1935 Magee, M., Chapman, C., Dellenback, S., Enke, B., Merline, W., Rigney, M.: Automated Identification of Martian Craters Using Image Processing. Lunar and Planetary Science XXXIV (2003) 1756 Dekker, R.: Texture Analysis and Classification of ERS SAR Images for Map Updating of Urban Areas in the Netherlands. IEEE Transactions on Geoscience and Remote Sensing, 41(9) (2003) 1950-1958 Clausi, D., Zhao, Y.: Rapid Extraction of Image Textures by Co-ocorrence Using a Hybrid Data Structure. Computers & Geosciences, 28 (2002) 763 – 774 Kayitakire, F., Giot, P., Defourny, P.: Discrimination Automatique de Peuplements Forestiers à partir d’Orthophotos Numériques Couleur: Un Cas d’ Étude en Belgique. Canadian Journal of Remote Sensing, 28 (2002) 629 – 640 Soille, P.: Morphological Image Analysis. Edition. Springer, Berlin (2003)
Classification of Dune Vegetation from Remotely Sensed Hyperspectral Images Steve De Backer1, Pieter Kempeneers2, Walter Debruyn2, and Paul Scheunders1 1
University of Antwerp, Groenenborgerlaan 171, 2020 Antwerpen, Belgium {steve.debacker, paul.scheunders}@ua.ac.be
2
Flemish Institute for Technological Research, Boerentang 200, 2400 Mol, Belgium {pieter.kempeneers, walter.debruyn}@vito.be
Abstract. Vegetation along coastlines is important to survey because of its biological value with respect to the conservation of nature, but also for security reasons since it forms a natural seawall. This paper studies the potential of airborne hyperspectral images to serve both objectives, applied to the Belgian coastline. Here, the aim is to build vegetation maps using automatic classification. A linear multiclass classifier is applied using the reflectance spectral bands as features. This classifier generates posterior class probabilities. Generally, in classification the class with maximum posterior value would be assigned to the pixel. In this paper, a new procedure is proposed for spatial classification smoothing. This procedure takes into account spatial information by letting the decision depend on the posterior probabilities of the neighboring pixels. This is shown to render smoother classification images and to decrease the classification error.
1 Introduction The goal of remote sensing is to acquire information about the substances present in a targeted area. This information is derived solely from the reflectance measured in the visual and infrared domain of the electro-magnetic spectrum. Traditionally, multispectral remote sensors acquired only a few wavelength bands. The study of vegetation was limited to vegetation indices, defined as specific ratios of bands. In recent years, hyperspectral sensors became available, allowing to sample the spectrum up to a few nanometer wavelength resolution. This type of data has been used for different types of vegetation monitoring, like e.g. weed detection [1] and investigation of vegetation on saltmarshes [2]. In this paper, we investigate the use of the hyperspectral images for vegetation monitoring at the coastal area. We consider the differentiation of multiple vegetation species from a hyperspectral remote sensed image. The goal of the classification is to build a low cost detailed vegetation map of dynamic dune area at the Belgian coast. Minimal field work, defining regions containing different species are applied to build a A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 497–503, 2004. © Springer-Verlag Berlin Heidelberg 2004
498
S. De Backer et al.
classifier that generates the detailed vegetation map from a hyperspectral image. These vegetation maps are applied to judge on the safety of the seawall by its vegetation and for dune environmental management. Before classification a vegetation mask is build, masking out all nonvegetation pixel. On the remaining pixels a multiclass classifier is applied using the reflectance spectra. The classifier outputs posterior class probabilities. In classification the class with maximum posterior value is assigned to the pixel. The obtained vegetation maps however are usually hard to interpret, because they appear as very noisy, containing a lot of single pixel classes. For the endusers, smooth vegetation maps with little class variation are required. Over the years various methods solving the spatial class variation problem has been proposed. In [3,4] comparison of different approaches to contextual classification is made. In this paper, we propose a new simple procedure for spatial classification smoothing, requiring little extra computational effort. The technique is a Bayesian one, adapting the obtained posterior class probabilities using a constructed prior probability that contains spatial information. The spatial information takes into account the posterior probability values of the neighboring pixels. In this way, a smoothing of the vegetation map is performed while maintaining minimal classification error. In next section we will introduce the applied multiclass classifier as a combination of binary classifiers. We will also show how to obtain class posterior probability values from this combination of binary classifiers. In Sect. 3 we introduce the proposed spatial classification smoothing technique. In Sect. 4 the technique is applied to hyperspectral imagery of coast vegetation.
2
Multiclass Classification
Multiclass versions of most classification algorithms exist, but they tend to be complex. A common approach to construct a multiclass classifier is by combining the output of several binary ones.
2.1
Binary Linear Classifier
For the binary classifier, we adopted a simple linear discriminant classifier (LDA)[5]. Assuming equal covariance matrices for both classes, this classifier finds the optimal linear boundary. A projection weight vector and threshold are the parameters to estimate in the two class problem, and are calculated by :
where and are the means of each class, and is the estimated class covariance matrix (we assume equal prior probability for both classes). Test samples are then classified by the simple rule
Classification of Dune Vegetation
499
This method is very fast to train and to calculate the classification. In case the training set is not sufficiently large, the can become singular. In these cases a pseudo-inverse approach can be used to find and [6]. In this work, we are not only interested in the assigned class, but in the posterior probabilities for both classes, which are estimated by:
2.2
Multiclass Classifier
Several methods are available for combining multiple binary classifiers into a multiclass classifier. Mostly, one-against-all or one-against-one [7,8] approaches are used. With the one-against-all strategy, each classifier is trained to differentiate one class from all the others, which requires a number of classifiers equal to the number of classes K. In the one-against-one approach, all possible pairs of classes are compared, requiring classifiers. Different methods defining other codings of the classes were also suggested [9,10]. Here, we will apply the one-against-one scheme.
2.3
Multiclass Posterior Probabilities
For each of the binary classifiers in the one-against-one combination, the posterior probabilities are obtained. We will follow [11] to obtain a combined posterior probability for the multiclass case. Define the pairwise probability estimates as when using the binary classifier which compares class against class as calculated in (3): For the K-class case we have to look for K
which satisfy
obtaining K – 1 free parameters and constraints, so it is generally impossible to find that will meet all equations. In [11], the authors opt to find the best approximation by minimizing the Kullback-Leibler distance between and
where predicts
is the number of training points to estimate the binary classifier that They also suggest an iterative scheme to minimize this distance
500
S. De Backer et al.
start with and initial guess for the repeat until convergence loop over normalize
and calculating
and calculate
For this algorithm Hastie and Tibshirani proved that the distance between and decreases at each step, and since the distance is bound above zero, the procedure converges. This procedure is repeated for all points in the test set
3
Spatial Classification Smoothing
For each pixel in the image we can calculate the posterior distributions using the technique of Sect. 2.3. Call the posterior probability for class calculated at location in the image. Normally, to assign a label to the pixel, the label of the class with maximum posterior probability is taken. Define as the class with the maximum posterior probability at location
This is not necessary the optimal way to go when using images. In fact, no knowledge about the spatial relation between the pixels is exploited. One can assume neighboring pixels to have similar posterior distributions. This information can be used as prior knowledge for defining a new prior distribution for a pixel, based on the posterior probability from classification in the neighborhood of the pixel. Define this new prior probability of a pixel as the average over the posterior probabilities of the neighboring pixels:
where N is the number of points in the neighborhood taken into account. When looking at as an image, the new prior is in fact a smoothed version of this image. A new posteriori probability is obtained by using Bayes’ rule:
Classifying using these single pixel classes.
will result in smoother image maps containing less
Classification of Dune Vegetation
501
Fig. 1. Image of dune area, with extracted area for which the calculated vegetation maps will be shown
4
Experiments and Discussion
A test area at the west coast of Belgium has been selected for which hyperspectral image data is obtained using the Compact Airborne Spectrographic Imager (CASI-2) sensor. The data was acquired using 48 spectral bands between 400 and 950 nm. The image data was corrected for atmospheric absorbtion and geometric distortion. Around the same time, a ground survey was taken on the location of different occurring plant species. During this field work 19 different plant species contained in 148 regions were identified. Using a differential GPS in the field, these regions together with their plant label were associated to pixels in the hyper spectral image data. Figure 1 shows the complete image, including the ground truth regions and a selected subimage for which the calculated vegetation maps will be shown. A threshold on the normalized difference vegetation index (NDVI), which is a ratio between the green and the infrared band, is used to mask out any nonvegetation spectra. The training data set consists of 2159 samples distributed over the 19 classes. The different classes are unbalanced, since their sizes vary strongly, ranging from 4 points to 703 points. This difference in size was assumed coincidental, and not to reflect the prior distribution of the different plant species. Therefore, we assumed equal prior probability for each species while building the classifier and estimating the multiclass posterior probabilities. As described in Sect. 2 the training set was used to estimate The data set contains K = 19 different classes. In the one-against-one
502
S. De Backer et al.
Fig. 2. Part of the image with pixel class labels color coded. Left the normal maximum posterior classification, and right including the neighborhood prior
multiclass approach this results in binary classifiers and thus 171 different are estimated. These values are then used to find applying definition (4), which in turn are used to find the posterior probabilities for each pixel in the image. Generally these are used to estimate the class for pixel using the maximum posterior (7). Here, we will use the proposed spatial classification smoothing by defining a new prior, taking into account a circular region with radius
Applying Bayes rule (9), we obtain new posterior values that are used for maximum posterior classification. In Fig. 2, we show part of the color coded classification result. The left image shows the result of the standard maximum posterior classification. The right image shows the results after including the extra prior probability step with One can immediately see that many single pixel classes and other small structures have vanished. This smoothing property is of importance when interpreting the classification image, when the user is not interested in finely detailed class information. Furthermore, the expected classification error using the standard procedure was found to be 11%, and decreased to 9% using the suggested procedure.
References 1.. Koger, C.H., Bruce, L.M., Shaw, D.R., Reddy, K.N.: Wavelet analysis of hyperspectral reflectance data for detecting morningglory in soybean. Remote Sensing of Environment 86 (2003) 108–119
Classification of Dune Vegetation
503
2. Schmidt, K.S., Skidmore, A.K.: Spectral discrimination of vegetation types in a coastal wetland. Remote Sensing of Environment 85 (2003) 92–108 3. Mohn, E., Hjort, N.L., Storvik, G.O.: A simulation study of some contextual classification methods for remotely sensed data. IEEE Transactions on Geoscience and Remote Sensing 25 (1987) 796–804 4. Cortijo, F., Pérez de la Blanca, N.: Improving classical contextual classifications. International Journal of Remote Sensing 19 (1998) 1591–1613 5. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd Edition. Wiley (2001) 6. Raudys, S., Duin, R.P.W.: Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters 19 (1998) 385– 392 7. Knerr, S., Personnaz, L., Dreyfus, G.: Single layer learning revisited: a stepwise procedure for building and training a neural network. In Fogelman-Soulié, F., Hérault, J., eds.: Neurocomputing: Algorithms, Architectures and Applications. Volume F68 of NATO ASI. Springer (1990) 41–50 8. Friedman, J.: Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford University (1996) 9. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research 2 (1995) 263– 286 10. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1 (2000) 113–141 11. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In Jordan, M.I., Kearns, M.J., Solla, S.A., eds.: Proceedings of the 1997 conference on Advances in neural information processing systems. Volume 10., The MIT Press (1998) 507–513
SAR Image Classification Based on Immune Clonal Feature Selection Xiangrong Zhang, Tan Shan, and Licheng Jiao National Key Lab for Radar Signal Processing, Institute of Intelligent Information Processing, Xidian University, 710071 Xi’an, China
[email protected]
Abstract. Texture provides valuable information for synthetic aperture radar (SAR) image classification, especially when the single-band and singlepolarized SAR is concerned. Three texture feature extraction methods including the gray-level co-occurrence matrix; the gray-gradient co-occurrence matrix and the energy measures of the undecimated wavelet decomposition are introduced to represent the textural information of SAR image. However, the simple combination of these features with each other is usually not suitable for SAR image classification due to the resulting redundancy and the additive computation complexity. Based on immune clonal selection algorithm, a new feature selection approach characterized by rapid convergence to global optimal solution is proposed and applied to find the optimal feature subset. Based on the features selected, SVMs are used to classify the land covers in SAR images. The effectiveness of feature subset selected and the validity of the proposed method are well verified by the experiment results.
1 Introduction Texture is an important characteristic used for identifying objects or regions of interest in an image [1]. Especially, for single-band and single-polarized SAR image classification, texture may provide abundant useful information. It contains significant information about the structural arrangement of surfaces and their relationship to the surrounding environment. It can be considered that texture is an innate property of all surfaces. There exist various methods in extracting textural information based on statistic methods, for example, the histogram statistic method, auto correlation function algorithm, energy spectrum and correlation frequency method etc. More recently, the texture features based on the gray-level co-occurrence matrix [1], [2] and the methods of multi-channel or multi-resolution have received much attention [3]. Obviously, combining different texture features above is helpful to improve the classification accuracy. However, the resulting redundancy and the additive computation time is usually contaminated the performance of classifiers. Accordingly, it is necessary to find the most suitable feature subset for SAR image classification. Feature selection is usually considered as an optimization problem, and evolution algorithms can be used. Among them, genetic algorithm (GA) is a global searching algorithm and is widely used in feature selection [4]. Unfortunately, GA has the unavoid A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 504-511, 2004. © Springer-Verlag Berlin Heidelberg 2004
SAR Image Classification Based on Immune Clonal Feature Selection
505
able disadvantages that the speed of convergence is low and the optimal solution cannot be obtained in limited generations because it emphasizes the competition alone and the communication between individuals is ignored. Unlike the traditional GA, immune clonal selection algorithm (ICSA) overcomes the shortcoming of GA to some degree [5]. It is based on the artificial immune system and the competition and cooperation coexist. ICSA demonstrates the self-adjustability function by accelerating or restraining the generation of antibodies, which enhances the diversity of the population while GA selects the individual of parent generation on the basis of the fitness only. Accordingly, a new feature selection method based on ICSA, immune clonal feature selection is proposed and applied to search the optimal texture feature subset. We are interested in the classification of SAR image finally. Recently, plentiful applications show that SVM has desirable advantages over the traditional pattern recognition methods [6]. Accordingly, SVM is applied to SAR image classification on the texture features selected by ICSA. This paper is organized as follows. In section 2, there is a brief review of three methods for extracting texture features. Then, the feature selection based on ICSA is presented in section 3. In section 4, SVM is applied to the classification and the proposed method is tested by two experiments of SAR image classification. Finally, conclusions are given in section 5.
2 Texture Feature Extraction In this paper, we investigate the performance of texture features derived from the gray-level co-occurrence matrix, from the gray-gradient co-occurrence matrix, and the energy measures of the undecimated wavelet decomposition [7] for land covers classification of SAR images. Gray-level co-occurrence matrix based texture features have achieved considerable success in texture classification in that it characterizes effectively the gray-tone spatial dependencies. From the gray-level co-occurrence matrix, we can extract 14 textural features [1]: angular second moment, contrast, correlation, sum of squares: variance, inverse difference moment, sum average, sum variance, sum entropy, entropy, difference variance, difference entropy, information measure of correlation I, information measure of correlation II and maximal correlation coefficient. However, the gray-level co-occurrence matrix based method cannot provide the edge information, which usually is an essential feature for objects identification task. A solution is to combine the features extracted from gray-level co-occurrence matrix with those extracted from gray-gradient co-occurrence matrix that concerns the associated statistic distribution of the gray and the edge gradient. From the gray-gradient co-occurrence matrix, the following 15 features can be computed: little gradient dominance, large gradient dominance, gray heterogeneity, gradient heterogeneity, energy, gray average, gradient average, gray mean square error, gradient mean square error, correlation, gray entropy, gray entropy, hybrid entropy, inertia and inverse difference moment. Furthermore, the undecimated wavelet decomposition provides robust texture features at the expense of redundancy. The energies, usually denoted by 1-norm in the
506
X. Zhang, T. Shan, and L. Jiao
literature, of the channels of the wavelet decomposition were found to be highly effective as features for texture analysis. In this paper, the texture feature set is made up of the energies of channels of the undecimated wavelet decomposition.
3 Immune Clonal Feature Selection ICSA is a good choice in solving an optimization problem [5]. Derived from traditional evolutionary algorithm, ICSA introduces the mechanisms of affinity maturation, clone and memorization. Rapid convergence and good global search capability characterize the performance of the corresponding operators. The property of rapid convergence to global optimum of ICSA is made use of to speed up the searching of the most suitable feature subset among a huge number of possible feature combinations.
3.1 A Brief Review of ICSA The clonal selection theory (Burnet, F.M., 1959) is used by the immune system to describe the basic features of an immune response to an antigenic stimulus; it establishes the idea that the cells are selected when they recognize the antigens and proliferate. When exposed to antigens, immune cells that may recognize and eliminate the antigens can be selected in the body and mount an effective response against them during the course of the clonal selection. The clonal operator is an antibody random map induced by the affinity including three steps: clone, clonal mutation and clonal selection. The state transfer of antibody population is denoted as follows:
According to the affinity function f , a point
in
the solution space will be divided into same points by using clonal operator. A new antibody population is produced after the clonal mutation and clonal selection are performed. Here, an antibody corresponds to an individual in EA.
3.2 Feature Selection Based on ICSA Feature selection based on ICSA can be stated as to identify the d most discriminative measurements out of a set of D potentially useful measurements, where namely, to find the antibody whose affinity is the maximum. The affinity corresponds to the value of evaluation function for measuring the efficiency of the selected feature subset in feature selection. For pattern classification, Bayes classification error minimization criterion is the ideal evaluation function, which is difficult to evaluate and analysis because one cannot easily obtain the probability distribution of data in practice however. An alternative method is to exploit the accuracy as the evaluation function, which is called the
SAR Image Classification Based on Immune Clonal Feature Selection
507
wrapper selection method [8]. In addition, because the goal of feature selection is to achieve the same or better performance using fewer features, the evaluation function should contain the number of features also [9]. However, using the accuracy as a part of the evaluation function is time-consuming because the learning algorithm must be used to every feature subset achieved during the course of feature selection and its generalization is not good because the feature selection must be performed again when the learning algorithm changed, although it can achieve higher accuracy. We apply, therefore, a filter selection method. Namely, the feature selection is considered as an independent procedure and distance measure is exploited to be separability criteria. It is well know that the Bhattacharyya distance determines the upper bound of Bayes classification error. So it is appropriate to use it as the separability measure of normal distributions for classes. Combining the distance measure and the dimension of feature subset selected, the affinity function is given as:
where Aff is the affinity of an individual, d the dimension of feature subset selected, and J the distance separability criteria. For two classes indexed by i and j, the Bhattacharyya distance is defined by
where
are the feature mean vectors and
denote the class covariance
matrices for classes i and j respectively. Class mean vector and covariance are estimated from available training data. For multi-class problems, the average Jeffreys-Matusita Distance (JMD) is used as the separability criteria. For C classes, define average JMD as
Formula (2) ensures that the higher the distance measure, the higher the affinity. And in the case that two subsets achieve the same performance, the subset with low dimension is preferred. Between the distance measure and the dimension of feature subset selected, the former is the major concern. Then feature selection is reduced to find so that In conclusion, immune clonal feature selection can be summarized as follows: Step1: Initialize Population Let k = 0, a binary encoding scheme is used to represent the presence or absence of a particular feature of the training samples. An antibody is a binary string whose length D is determined by the number of total features extracted. Let denote an antibody, where denotes locus, and let when the associated feature is absent,
when the associated feature is present. The
initial antibody population A(0) is generated randomly and each one of tion size) antibodies represents a different feature subset.
(popula-
508
X. Zhang, T. Shan, and L. Jiao
Step2: Evaluate the Affinity Each antibody is decoded to the corresponding features combination and the new training sample sets are achieved. The affinity {Aff(A(0))} is evaluated with (2). Step3: Determine the Iterative Termination Condition The termination condition can be a threshold obtained by the affinity or be the maximal evolution generation. If it holds, the iteration stops and then the optimal antibody in current population is the final solution. Otherwise, it continues. Step4: Clone Implement the clonal operator on current parent population A(k), then The clonal size
of each individual can
be determined proportionally by the affinity between antibody and antigen or be a constant integer for convenience. Step5: Clonal Mutation The clonal mutation operator is implemented on A'(k) with the mutation probability and then A" (k) is achieved. Step6: Evaluate the Affinity Corresponding feature subset is got based on each antibody of current population A"(k) and the new training datasets are obtained. The affinity {Aff(A"(k))} is evaluated with (2). Step7: Clonal Selection In subpopulation, if mutated antibody exists so as to replaces the antibody and is added to the new parent population, namely, the antibodies are selected proportionally as the new population of next generation A(k +1) based on the affinity. Step8: Evaluate the Affinity Based on each antibody of current population, corresponding feature subset is obtained and the new training sample sets are achieved. The affinity {Aff (A(k + 1))} is evaluated with (2). Step9: k = k + 1, return Step3.
4 Application of SVM to SAR Image Classification As a member of many kernel methods, SVM is a relatively new learning algorithm proposed by Vapnik, which can be non-linearly mapped to a higher-order feature space by replacing the dot product operation in the input space with a kernel function K(·,·). The method is to find the best decision hyperplane that separates the positive examples and negative examples with maximum margin. By defining the hyperplane this way, SVM can be generalized to unknown instances effectively, which has been proved by varies applications. So, SVM is used to classify the land covers in SAR images based on the selected feature vectors.
SAR Image Classification Based on Immune Clonal Feature Selection
509
Two experiments have been carried out to test the efficiency of the proposed method. In the first experiment, X-band SAR sub-image of Switzerland is used, where there exist three land covers: lake, urban and mountainous region. Generalization and representative samples are selected first. Feature vector with dimension 29 is extracted for the center pixel in a certain window region with size 5×5 from the gray-level co-occurrence matrix and the gray-gradient co-occurrence matrix. In ICSA, the antibody population size is 10 and the length of each antibody is 29. The number of evolutionary generation is 50. The clonal size is 5. The RBF kernel is used in SVMs. The classification results are shown in Fig.1.
Fig. 1. X-SAR image classification (a) original image, (b) classification using the features from the gray-level co-occurrence matrix, (c) classification using the features from the gray-gradient co-occurrence matrices, (d) classification by combining these two kinds of features, and (e) classification using the proposed method
From Fig. 1, it is obvious that the proposed method with 9 features selected outperforms those based on features extracted from gray-level co-occurrence matrices and gray-grad co-occurrence matrices respectively, and than that of the method based on the simple combination of these two kinds of features with 29 dimensions. Further experiment is carried out on the Ku-band SAR image with 1-m resolution, from the area of Rio Grande River near Albuquerque, New Mexico. It is classified into three different land-cover classes. Unlike experiment 1, the texture features with 24-dimension are extracted from gray-level co-occurrence matrices with window 5×5 and from the energy of channels of the undecimated wavelet decomposition with window 8×8. The decomposition level of wavelet is 3, and the resulting feature set is of dimension 10. In ICSA, the antibody population size is 10 and the length of each antibody is 29. The number of evolutionary generations is 50. The clonal size is
510
X. Zhang, T. Shan, and L. Jiao
5. The input texture feature vectors are normalized to values between 0.0 and 1.0. The RBF kernel is used in SVMs. Classification results are shown in Fig. 2.
Fig. 2. Ku-band SAR image classification (a) original image, (b) classification using the features from gray-level co-occurrence matrix, (c) classification using the energy measures of the undecimated wavelet decomposition, (d) classification by combining these two kinds of features, and (e) classification using the proposed method
From Fig. 2, we can get that the result based on the undecimated wavelet decomposition is more consistent than other results. However, it loses much more details such as the pipeline on the river. The classification result based on the combination of these two kinds of features and that of the proposed method both preserve the details. But the latter applied fewer features than the former. Note that the feature dimensions used in the two methods are 24 and 4 respectively. Furthermore, we have compared GA based feature selection with immune clonal feature selection for the second experiment. For both algorithms, the termination condition is the number of generations, 50. Two methods took the approximately same time 9 seconds in the step of feature selection. But the immune clonal feature selection converges to the stable affinity 1995.72287 at the 20th generation and the dimension of the feature subset selected is 4, while GA based feature selection method gets to the affinity 1994.59083 when the iteration stops and the dimension of the feature subset selected is 7. From the results, we can draw the conclusion that the proposed method is more effective than GA in terms of feature selection.
SAR Image Classification Based on Immune Clonal Feature Selection
511
5 Conclusion A systematic way of texture feature extraction, feature selection and classification of land covers using SAR image is presented in the paper. Three methods have been introduced in the stage of texture feature extraction. And the Immune Clonal Feature Selection is proposed and applied to find the optimal feature subset in the space of all possible feature subsets taken from gray-level co-occurrence matrices, gray-gradient co-occurrence matrices, and the energy measures of the undecimated wavelet decomposition of one local region in an image. The characteristic of rapid convergence of ICSA ensures that the optimal features combination can be achieved rapidly. Then the classification of land covers is carried out with SVMs characterized by good generalization. Experiment results show that the texture feature subset selected and SVMs can be successfully applied in the land-cover classification of single-band and singlepolarized SAR images. However, further research is necessary to perform in feature extraction such as the application of the multiscale geometry analysis, which includes ridgelet analysis, curvelet, brushlet and etc. These methods may be more effective to characterize the texture features of an image.
References 1. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Classification. IEEE Trans. on System, Man, and Cybernetics. 3 (1973) 610–621 2. Solberg, A.H.S., Jain, A.K.: Texture Fusion and Feature Selection Applied to SAR Imagery. IEEE Trans. on Geoscience and Remote Sensing. 35 (1997) 475–479 3. Peleg, S., Naor, J., Hartley, R., Avnir, D.: Multiple Resolution Texture Analysis and Classification. IEEE Trans. on Pattern Analysis and Machine Intelligence. 6 (1984) 518–523 4. Yang, J., Honavar, V.: Feature Subset Selection Using a Genetic Algorithm. IEEE Trans. on Intelligent Systems. 13 (1998) 44–49 5. Jiao, L.C., Du, H.F.: Development and Prospect of the Artificial Immune System. Acta Electronica Sinica. 31 (2003) 73–80 6. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 7. Fukuda, S., Hirosawa, H.: A Wavelet-Based Texture Feature Set Applied to Classification of Multifrequency Polarimetric SAR Images. IEEE Trans. on Geoscience and Remote Sensing. 37 (1999) 2282–2286 8. Kohavi, R., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence Journal. 97(1997) 273–324 9. Sun, Z.H., Yuan, X.J., Bebis, G., Louis, S.J.: Neural-Network-based Gender Classification Using Genetic Search for Eigen-Feature Selection. IEEE International Joint Conference on Neural Networks. 3 (2002) 2433–2438
Depth Extraction System Using Stereo Pairs Rizwan Ghaffar1, Noman Jafri1, and Shoab Ahmed Khan2 1 Military College of Signals National University of Sciences and Technology, Pakistan
[email protected] [email protected] 2
Centre for Advanced Research in Engineering, Pakistan
[email protected]
Abstract. Stereo vision refers to the ability to infer information on the 3-D structure of a scene from two or more images taken from different viewpoints. Stereo pairs stand as an imperative source for depth extraction. The same are being used for giving vision to automated vehicles and creating Digital Elevation Models from satellite stereo pairs. Processing a pair involves certain steps and techniques exist for each one. But there is not a completely defined approach which encompasses these steps directing to depth extraction from an uncaliberated stereo pair. This paper describes a system which automatically recovers the depth information form two narrow baseline frames. Our contributions include the development of the system, automation of the complete process, comparison of the rectification approaches and the stereo matching techniques for disparity maps. We have proposed a system which takes the uncaliberated stereo pair as the input and gives estimated relative depth. After testing a number of stereo pairs and comparing them with the ground truth, the experimental results demonstrate that our proposed approach leads to a system encompassing the state-of-art algorithms which extracts the relative depth information from a stereo pair.
1 Introduction Stereo vision refers to the ability to infer information on the 3-D structure and elevation, from two or more images taken from different viewpoints. A classic stereo pair is a narrow baseline stereo with two cameras shortly displaced from each other while wide baseline stereo involves the cameras largely displaced and resultantly the images have a lot of occluded regions. Epipolar Geometry
Fig. 1. Epipolar Geometry
Our vivid 3-D perception of the world is due to the interpretation that the brain gives of the computed difference in the retinal position, named disparity between corresponding items and it forms the foundation of stereo vision. Epipolar geometry A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 512–519, 2004. © Springer-Verlag Berlin Heidelberg 2004
Depth Extraction System Using Stereo Pairs
513
[1] which is the intrinsic projective geometry between two views reduces the search space for finding the correspondences between the 2 images. Basing on this a stereo system can compute 3-D information without any prior knowledge of the stereo internal and external parameters (uncaliberated stereo). Figure 1 shows two images of a single scene related by the epipolar geometry. This geometry can be described by a 3x3 singular matrix called the essential matrix if images’ internal parameters are known, or the fundamental matrix otherwise. Let C and C’ be a pair of pinhole cameras in 3D space. Let m and m’ be the projections through C and C’ of a 3D point M in images I and I’ respectively. The basic line equation states The fundamental matrix maps points in I to lines in I’, and points in I’ to lines in I. This is called epipolar constraint i.e. Above 2 equations lead to where F is called fundamental matrix. For a fundamental matrix F there exists a pair of unique points The points e and e’ are known as the epipoles of image I and image I’ respectively. The epipoles have the property that all epipolar lines in I pass through e, similarly all epipolar lines in I’ pass through e’.
Photogrammetry Photogrammetry is the science, art and technology of obtaining reliable measurements, maps, digital elevation models and other derived products from photographs. The objective is to determine parallax between the corresponding points in the stereo pairs. The term parallax refers to the apparent change in the relative positions of the stationary objects caused by a change in the viewing position so it passes on to disparity in a stereo pair.
2 System Modules Developing a system that computes relative depth requires as its basis the determination of epipolar geometry of the stereo pair
Fig. 2. System Block Diagram
514
R. Ghaffar, N. Jafri, and S.A. Khan
which subsequently needs 8 matching points in a stereo pair. Finding these points lead to the determination of the fundamental matrix which then simplifies the correspondence problem. A system approach for this solution is shown in the form of block diagram in figure 2.
A Matching
The first step is to find a minimum of 8 matching points between the left and right image. They can be selected manually but automated system relies on corner detection being useful features for motion correspondence and stereo matching. B Recovering Epipolar Geometry
Epipolar constraint that corresponding points must lie on the conjugated epipolar lines underlines the significance of epipolar geometry. Knowing the 8 matching points the fundamental matrix is calculated which can be used for rectification and to do many other wonderful things as scene modeling and vehicle navigation.
C Image Rectification
Rectification determines a transformation or warping of each image such that pairs of conjugate epipolar lines become collinear and parallel to one of the image axis, usually the horizontal one. The disparities can be found then instead of searching along the skewed scan lines which is computationally intensive along the image rows.
D Disparity Map
After image rectification the disparity map can be found by any of the matching technique such as correlation, SSD, SAD, Graph cuts or dynamic programming. The Disparity map gives us the relative depth information but for 3-D reconstruction, information on the geometry of the stereo system is also needed.
3 System Overview Basing on the modules described above we have proposed a system which takes a stereo pair as it’s input and gives an estimated depth map as the output. Different techniques have been analysed and the algorithms that gave comparatively better results have been incorporated.
Depth Extraction System Using Stereo Pairs
515
A Matching Block The need for automation has led to 2 sub-blocks as corner detection and matching. Finding corners, or corner-like features, in an image is a problem common to many computer vision techniques. Usually it is desired to track a feature that has high spatial frequency content in all directions. A corner detector [2] functions normally by considering a local window in the image, and determining the average changes that result from shifting the window in all directions and if this shift results in a large change a corner is detected.
Fig. 3. Top Pair. Output of Harris Corner Detector. 222 and 201 corners detected in left and right image respectively. Bottom Pair. 126 corners are matched in both images
G. Scott and H. Longuet-Higgins [3] proposed an algorithm for associating the features of two patterns. This approach copes with translation and scaling deformations with moderate rotations. A remarkable feature of the algorithm is its straightforward implementation founded on a well-conditioned eigenvector solution which involves no explicit iterations. A pair of images I and I’ contain m corners and n corners respectively, which will be put in one-to-one correspondence. The algorithm builds a proximity matrix G of the two sets of features where each element is Gaussian-weighted distance between two features and
This correspondence strength is correlation weighted proximity
B Compute Fundamental Matrix The essential property of the fundamental matrix is that it conveniently encapsulates the epipolar geometry of the uncalibrated imaging configuration. 1 Eight-Point Algorithm Hartley’s 8 point normalized algorithm [8] requires 8 independent correspondences. Each matching pair of points between the two images provides a single linear constraint on F. This allows F to be estimated linearly up to the usual arbitrary scale factor. A match provides a linear constraint on the coefficients of F as defined in (3).
516
R. Ghaffar, N. Jafri, and S.A. Khan
The fundamental matrix is given as
With the 2 corresponding points
and
we get
and for 8 corresponding points The solution to this problem is the unit eigenvector corresponding to the smallest eigenvalue of so the solution is linear. The singularity constraint is then applied to the resulting matrix
C Image Rectification As the epipolar geometry has been determined so the corresponding points between the two images must satisfy the epipolar constraint. In general, epipolar lines are not aligned with coordinate axis and are not parallel. This parallelism can be enforced by applying 2D projective transforms, or homographies, to each image which is known as image rectification.
Fig. 4. Zheng’s Approach. Top Pair. Left image and it’s rectified image . Bottom Pair. Right image and it’s rectified image
1 Zheng’s Approach Rectification [7] involves decomposing each homography into a projective and affine component. Then the projective component that minimizes a well defined projective distortion criterion is found. The affine component of each homography is decomposed into a pair of simpler transforms, one designed to satisfy the constraints for rectification, the other is used to further reduce the distortion introduced by the projective component. 2 Pre Warping Image morphing, or metamorphosis, is a popular class of techniques for producing transitions between images. Image warping is in essence a transformation that changes the spatial configuration of an image. Seitz [6] approach of morphing involves three steps out of which first step is rectification called as prewarp.
Depth Extraction System Using Stereo Pairs
517
Fig. 5. Aerial Image. Left: Rectified image Centre: Disparity Map Right: Depth Map.
Fig. 6. Relative Depth Map of indoor rectified image.
D Disparity Calculation
After the images have been rectified the next step is the calculation of Disparity matrix i.e. the parallax between the corresponding points. The matrix so obtained is called the disparity matrix and the map as the disparity map. There are various existing matching techniques for disparity calculation as:Normalized Cross Correlation (NCC) Sum of Squared Differences (SSD) Sum of Absolute Differences (SAD) But certain global methods as dynamic programming and graph cuts produce better results. We have used Graph cuts as this method gives the best results by accounting for the occluded features thereby reducing the errors. It accounts for determining the disparity surface as the minimum cut of the maximum flow in a graph.
E Photogrammetry
Photogrammetry [9] is defined as the science of making accurate measurements by means of aerial photography. After the parallax between the 2 images is obtained then principles of triangulation defined in photogrammetry are used for the extraction of
518
R. Ghaffar, N. Jafri, and S.A. Khan
3D information. This approach of uncaliberated stereo pair leads to 3D data whose accuracy is upto a projective transform of the original data while if we have the necessary caliberation information then the approach of Essential matrix is adopted and it leads to a solution whose accuracy is upto a scalar factor of the original information. These relative displacements form the basis for three dimensional viewing of overlapping photographs. In addition, they can be measured and used to compute the elevations of terrain points.
4 Important Assumptions No image processing technique has been applied on the stereo images in this system. We have assumed that enough image processing has already been carried out and images are already enhanced, free from noise and lens distortion.
5 Experimental Results The images we used as the input to our system are from the public domain resources (INRIA, Tsukuba). We have shown the results for two sequences of real scenes, one an aerial stereo pair (Fig 5) and one an indoor rectified pair (Fig 6). As the aerial images pass through the modules of the system their epipolar geometry is determined and they are rectified and finally their relative depth map is obtained. The indoor pair as is already rectified so its relative depth map is shown directly. The final relative depth maps show that our proposed system is effective and stable. We also tested for both the rectification approaches and compared our results in Table 1. The basis of comparison is the rectification criteria that corresponding points to lie in the same row and the table shows the row difference between corresponding points of the aerial stereo pair. We see Zheng’s approach leading to more accurate results so it forms part of our proposed system.
Depth Extraction System Using Stereo Pairs
519
6 Conclusion In this paper, we successfully solved the problem of how to obtain reliable depth map over two stereo frames. First, a number of corners are detected in both images. Then, based on SVD decomposition, we found matching between two corners and then we estimated epipolar geometry. We efficiently rectified the images basing on the recovered epipolar geometry. Finally, we demonstrated that a relative depth map can be obtained by using the graph cuts approach. We tested a number of stereo image pairs under different camera motions and have obtained the excellent results for all of these cases.
References [1] Emanuele Trucco and Alessandro Verri, “Introductory Techniques for 3-D Computer
Vision”. [2] C.G. Harris and M.J. Stephens. “A combined corner and edge detector”, Proceedings [3] [4] [5]
[6] [7] [8] [9]
[10] [11]
Fourth Alvey Vision Conference, Manchester. pp 147-151, 1988. G. Scott and H. Longuet-Higgins. An algorithm for associating the features of two patterns. In Proc. Royal Society London, 1991. Maurizio Pilu. “Uncalibrated Stereo Correspondence by Singular Value Decomposition”, Computer Vision and Pattern recognition, June, 1997. M.A. Fischler and R.C. Bolles,” Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography” CACM, 24(6):381–395, June 1981. S. Seitz. “Image-Based Transformation of Viewpoint and Scene Appearance”, PhD thesis, University of Wisconsin, 1997. Z. Zhang, “Computing Rectifying Homographies for Stereo Vision”, Microsoft Research, June, 2001 Richard I. Hartley, “In Defense of the Eight-Point Algorithm”, IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 19, No. 6, June 1997 Thomas M. Lilles and Ralph W. Kieferr, “Remote Sensing and Image Interpretation”, edition, John Wiley & Sons, Inc. 1994. Jiangjian Xiao and Mubarak Shah, “Two-FrameWide Baseline Matching”, Computer Vision Lab, School of Electrical Engineering and Computer Science, Vladimir Kolmogorov and Ramin Zabih “Multi-camera Scene Reconstruction via Graph Cuts” Computer Science Department, Cornell University
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network Yazhou Liu, Wen Gao, Hongxun Yao, Shaohui Liu, and Lijun Wang Computer Science and Technology Department of Harbin Institute of Technology, 150001, Harbin, the People’s Republic of China
[email protected] [email protected] {yhx,shaol,ljwang}@vilab.hit.edu.cn
Abstract. In this paper we present a simple yet effective temporal differencing based moving region detection scheme which can be used in limited resource condition such as in ad-hoc sensor network. Our objective is to achieve realtime detection in these low-end image sensor nodes. By double-threshold temporal differencing we can exclude the effect of global motion as well as detect the real motion regions. Then to achieve fast processing speed and overcome foreground aperture problem, we scale down the searching space to a rather small size, 30x40, and apply our Scalable Teeterboard Template to locate moving regions’ bounding boxes. Resource requirement and time complex of our scheme are very low yet experiment result shows that our scheme yields a high detection rate. And our scheme’s speed and detection rate cannot be affected essentially by the number of objects in the field of view.
1 Introduction Wireless sensor network has been an active research field for it’s widely application, such as battle field surveillance and environmental monitoring. The motivation of our research is to incorporate visual ability to the network, which has often been neglected in most of previous applications for image’s huge processing and communication cost. So our goal is to develop effective way which can be used in this resource limited condition for visual information based surveillance. This paper will address the problem of how to find the moving regions effectively by our low-end image sensor nodes
1.1 Application Background A wireless ad-hoc sensor network (WASN) is a network of sensor nodes with limited on-node sensing, processing, and communication capabilities. These distributed sensor networks are essential for effective surveillance in battle field and environmental monitoring. Extensive research works have been done in this field. Two critical areas A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 520–527, 2004. © Springer-Verlag Berlin Heidelberg 2004
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network
521
are: (i) efficient networking techniques, and (ii) collaborative signal processing (CSP) to efficiently process distributed information gathered. In CSP, multi-channel information, seismic, acoustic information in [1] and humidity, temperature in [2], has been utilized for target object identification and environmental monitoring. Unfortunately, visual information has often been neglected in most of these applications. One possible explanation is that image information is so huge comparing with other information for low-end sensor node. Its transmission and processing cost seems sharply contradict with the “smart dust” spirit, massive deployment of low-cost sensor nodes. But undebatably, only visual information can give the most intuitional impression in some applications. For instance, in remote boundaries surveillance, it is really a tough task for sensors to distinguish between wild animal and stowaway only by seismic, acoustic or infrared information and also it is hard to find a resting vehicle troop by means other than visual information. Capturing visual information by ad-hoc networks is a challenging task because of their limited processing and transmitting capabilities. Our system follows the finding to understanding model [4]. Surveillance task can be accomplished by two-layer processing. Firstly, finding layer, instead of semantic understanding, is mainly to pick out the object of interest that can be recognized by higher level processing, then transmit these hot regions by RF. Our low-end image sensors can accomplish this finding task by some simple yet fast, effective algorithms, Scalable Teeterboard Template, which is specially designed for limited hardware resources, this paper will focus on this part. Secondly, since most pattern recognition models and algorithms in computer vision are time and resource consuming, so we left understanding mission to be dealt by higher layer, base station or wearable computer, which processing ability is much higher than sensor nodes. Our finding-understanding structure can be seen in Fig. 1.
Fig. 1. Overview of our finding-understanding structure. Moving regions are found by temporal differencing and Teeterboard template, and then are transmitted to higher level (base station/wearable computers) for further classification.
Our camera nodes can work in two models, event model and query model. In event model, camera’s work was triggered by some emergent events, such as seismic, acoustic and temperature of the region is beyond some predefined threshold. In this
522
Y. Liu et al.
case, camera nodes catch the moving objects in their visual fields and transmit them by multi-hop to the base station for further analysis. In query model, camera node is triggered by query commands from top level and captures the whole image for transmission. This model is for the cases in which target object is motionless and noiseless, such as a resting vehicle troop. Rather than an extensive description of our system, this paper will address the problem of how to find the moving regions effectively by our low-end image sensor nodes. Our detection scheme is temporal differencing based and some related work will be introduced in the following section.
1.2 Previous Work Video stream based target detection and tracking scheme has been a hot research topic for years and many good results have been reported. One of the most commonly used means is temporal differencing. There are schemes, as propose in [4][5], for effective motion detection but they are not suitable for our special application. In [4], a three-level Wallflower scheme has been proposed, which try to solve many problems exist in background maintenance, such as Light switch, Foreground aperture and etc. But it needs a training period to get better performance. In W4 [5], three values, maximum value (M), minimum value (N) and the largest interframe absolute difference (D), were stored for each pixel. D was used as threshold to determine a given pixel belong to foreground or background. Also a training period was indispensable and the camera should be stationary to fulfill this background model. But in our application, maybe there is not enough time for training. Such as in Event model, cameras were trigger by some emergent events and target objects maybe already in the field of view. Thus in this case, a period of pure stationary background is not available. Also our sensors of camera may be not fastened tightly on the ground and modest sway by wind is possible. So these schemes are not suitable for our application scenario. Pfnder [6] is a real-time system for tracking a person which uses a multi-class statistical model of color and shape to segment a person from a background scene. Like many other systems, color information has been exploited for object segmentation. Color information is not used since our tracking objects have not definite color and grayscale image require less memory. Examples in this paper are color images only for clarity reason and our real system operates on grayscale image. Or more specifically, most of our operations are based on binary images for memory saving reason.
2 Scalable Teeterboard Template Scheme For this special application, our goal is to find moving regions in the field of view and cut them out for further analysis. So scheme exploited here should be simple, quick, easy for hardware implementing and memory-saving without sacrifice much detection accuracy.
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network
523
2.1 Temporal Differencing CCD image sensor is a highly noise-prone device. It can be affected by the change of temperature, light and other complicate reasons. Even during a short period, two consecutive frames may different globally (without moving objects), this cause great difficulty for temporal differencing based algorithms. So some low-passed filters are often needed to denoise the images before differencing. The most commonly used one is Gaussian filter. An 3x3 Gaussian Template is
We can see that for each pixel, 10 shifts and 8 additions are needed. For our 320x240 video frames, there are 768000 (10x320x240) shifts and 614400 (8x320x240) additions for each single frame before differencing. These operations seem still time-consuming for our low-end sensor nodes. So we omit this filter preprocessing and let the latter erosion operation decrease the noise as well as delete small regions. Our experimental result shows that this simplified processing does yield comparably clear foreground regions as the ones with filter. Without low-passed filtering, we applied our double thresholds temporal differencing directly. The first threshold is to determine the different regions (include the ones affected by noise) of two consecutive video frames. If is the density of the nth frame, then the pixelwise difference function is and a raw motion mask image can be extracted by thresholding
But there are conditions, such as differences caused by global motion of camera or light switch, in which false alarm may occur. These changes are not what we really interested in. In addition, moving regions detected by temporal differencing are rather big in these cases, which will increase the communication burden and may blocked other useful information. So we introduced another frame level threshold to exclude these false alarm transmissions. And we can get the motion mask image by
Thus far we can get binary motion mask image as shown in Fig. 2 (b). Without filter preprocessing, it is stained by noise seriously. One iteration of erosion and dilation operation is applied to delete noise as well as small regions. To make teeterboard checking faster and deal with the foreground aperture problem [4], we 1/8 down sampling the motion mask image and apply another dilation operation, the 40x30 result
524
Y. Liu et al.
image can be seen in Fig. 2 (d). The motion regions corresponding to the biggest white car are disjoint in Fig. 2 (c) because of its homogeneous color. Fig. 2 (d) shows that our simple scheme can solve this aperture problem successfully and these disjoint parts merge together. Then we applied our scalable teeterboard template to locate each object and find their corresponding bounding box.
Fig. 2. Flow chart of motion region finding. (a) Original input frames of size 320x240; (b) Temporal differencing between two consecutive frames; (c) Erosion and Dilation result; (d) 1/8 down sampling and erosion result; (d) Bounding boxes found by teeterboard template; (f) detecting result
2.2 Scalable Teeterboard Template To find bounding box of each moving object effectively, we introduce a scalable teeterboard template. The initial size of the template is 5x5, as shown in Fig. 3. We raster scan the 40x30 binary mask image, If the newly met black pixel is already within some motion region’s bounding box, we just skip it, and otherwise we take the pixel as the center of teeterboard and begin searching a new motion region. There are two kinds of operation for this template before it can locate the bound of the region, moving and extending. Moving operation is to move the teeterboard’s center to interior or dense part of the black (motion) region. And the moving direction is the heaviest side of the teeterboard. The weight of each direction is defined as:
Where is the distance between the pixel (u,v) and the center (x,y) of the teeterboard in direction i.
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network
525
And each direction’s corresponding region is defined as in Fig. 3. So the moving direction Dir is
Moving step size is a quarter of the template’s width or height, which depends on the moving direction.
and respectively.
are teeterboard template’s length in vertical and horizontal direction
Fig. 3. Initial Teeterboard Template
By moving the teeterboard towards its heaviest side, we can guarantee that this chosen direction points to the interior or dense part of the black (motion) region. But in some cases, the teeterboard is balanced, such as the black region is much bigger than the teeterboard and the teeterboard may stop at the corner of the region, then we apply the extending operation. Extending is to enlarge the teeterboard. One goal of this operation is to break the balance of the teeterboard, so that it can continue its motion towards the gravity center of the motion (black) region. Another goal is to make the teeterboard big enough to cover the whole motion region. So the bounds of the teeterboard are also the bounds of the motion region when the teeterboard stops moving and extending. We chose the teeterboard’s balance direction as the extending direction, which means that if then we extend the teeterboard horizontally. The extending step size we chose here is the same as the moving step size above. If there are no new black pixels added to the teeterboard region after two consecutive horizontal and vertical extensions, which means teeterboard has cover the whole motion region, teeterboard will stop searching. Bounding boxes found by this scheme are shown in Fig. 2 (e) and their corresponding motion regions locating result can be seen in Fig. 2 (f). Above description shows that our fast locating scheme on the one hand benefits from its limited searching space, here is a 40x30 binary image. On the other hand, if a motion region’s bounding box is found, all black pixels in this region will be skipped. So each black pixel will be scan at most once and time complexity of this teeterboard searching scheme is O(n), where n is the number of black pixels in 40x30 down sampling image and
526
Y. Liu et al.
3 Experiments The description above of our algorithm show that it takes rather brute means, erosion and 1/8 down sampling, to denoise, overcome aperture problem as well as get smaller search space. This means we may lose some small moving objects which are deleted as noise. Also speed of the moving object is a key factor which affects the detection accuracy dramatically. We evaluate the performance of our scheme under low moving speed and small object size condition. The sampling rate of our test video stream here is 8 frames per second and frame size is 320x240. We define the detection rate R as
Where
is the total number of frames in which an object occurs and is the number of frames in which the object can be detected by our scheme. The test results on 80 video streams (125 frames per stream) are shown in Table 1. Also our scheme can deal with foreground aperture problem effectively. Disjoint regions of the same object obtain by temporal differencing can be merged together for most of time by 1/8 down sampling and dilation. If not, Scalable Teeterboard Template can achieve this goal further by extending. Some examples of foreground aperture and detection result can be seen in Fig. 4.
4 Conclusion In this paper we proposed a simple yet effective moving object detection scheme which is suitable for ad-hoc sensor network. To make image processing and communication possible in our low-end sensor nodes, we exploit double thresholds method to exclude false alarm transmission caused by global motion as well as detect moving regions. Then by scaling down the searching space to a rather small size, a 40x30 binary mask image here, and applying our scalable teeterboard template, we can find the moving regions’ bounding boxes effectively and efficiently. Only regions in bounding boxes are transmitted to higher level nodes for analysis. By these means communication and processing burden can be decreased greatly. Future work will
Fast Moving Region Detection Scheme in Ad Hoc Sensor Network
527
focus on the moving direction estimation of the same object in different frames, and communication amount can be decreased further.
Fig. 4. Some detection results for foreground aperture problem
References 1. Brooks, R.R., Ramanathan, P., Sayeed, A.M.: Distributed Target Classification and Tracking in Sensor Network. Proceedings of the IEEE, Volume: 91 , Issue: 8 , (Aug. 2003) 1163– 1171 2. Alan Main waring, Joseph Polastre, Robert Szewczyk, David Culler, John Anderson.: Wireless Sensor Networks for Habitat Monitoring. ACM International Workshop on Wireless Sensor Networks and Applications. Atlanta, GA., September 28, 2002 3. A. Lipton, H. Fujiyoshi, R. Patil: Moving Target Classification and Tracking from Realtime Video. Proc. of the 1998 DARPA Image Understanding Workshop (IUW’98), November, 1998 4. K. Toyama, J. Krumm, B. Brumitt, B. Meyers: Wallflower: Principles and practice of background maintenance. In Proc. Int. Conf. Computer Vision, Corfu, Greece, (1999) pp. 255– 261 5. I. Haritaoglu, D. Harwood, L. Davis, : W4: Who? When? Where? What? - A real time system for detecting and tracking people. In Proc. of Intl. Conf. on Automatic Face and Gesture Recognition, Nara, Japan, (1998) 222–227 6. Wren, Cr, Azarbayejani, A., Darrell, T., Pentland, A. P.: Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis & Machine Intelligence, 19(7), (1997) 780–785
LOD Canny Edge Based Boundary Edge Selection for Human Body Tracking Jihun Park1, Tae-Yong Kim2, and Sunghun Park3 1
Department of Computer Engineering Hongik University, Seoul, Korea
[email protected]
2
Department of Computer Science and Engineering Korea University, Seoul, Korea
3
Department of Management Information Systems Myongji University, Seoul, Korea
[email protected]
[email protected]
Abstract. We propose a simple method for tracking a nonparameterized subject contour in a single video stream with a moving camera and changing background. Our method is based on level-of-detail (LOD) Canny edge maps and graph-based routing operations on the LOD maps. LOD Canny edge maps are generated by changing scale parameters for a given image. Simple (strong) Canny edge map has the smallest number of edge pixels while the most detailed Canny edge map, has the biggest number of edge pixels. We start our basic tracking using strong Canny edges generated from large image intensity gradients of an input image, called Scanny edges to reduce side-effects because of irrelevant edges. Starting from Scanny edges, we get more edge pixels ranging from simple Canny edge maps until the most detailed Canny edge maps. LOD Canny edge pixels become nodes in routing, and LOD values of adjacent edge pixels determine routing costs between the nodes. We find a best route to follow Canny edge pixels favoring stronger Canny edge pixels. Our accurate tracking is based on reducing effects from irrelevant edges by selecting the stronger edge pixels, thereby relying on the current frame edge pixel as much as possible contrary to other approaches of always combining the previous contour. Our experimental results show that this tracking approach is robust enough to handle a complex-textured scene.
1 Introduction and Related Works Tracking moving subjects is a hot issue because of a wide variety of applications in computer vision, video coding, video surveillance, monitoring and augmented reality. This paper addresses the problem of selecting boundary edges for robust contour tracking in a single video stream with a moving camera and changing background. We track a highly textured subject moving in a complex scene compared to a relatively simple subject tracking done by others. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 528–535, 2004. © Springer-Verlag Berlin Heidelberg 2004
LOD Canny Edge Based Boundary Edge Selection
529
In the method of tracking a nonparameterized contour, a subject contour as a subject border is represented. The contour created by these algorithms is represented as a set of pixels. Paragios’s algorithm[1] and Nguyen’s algorithm[2] are popular in these approaches. Recently, Nguyen proposed a method [2] for tracking a nonparameterized subject contour. Nguyen’s approach combined outputs of two steps: creating a predicted contour and removing background edges. Nguyen’s background edge removal method of leaving many irrelevant edges is subject to inaccurate contour tracking in a complex scene because removing background edges is difficult. Nguyen’s method[2] of combining the predicted contour computed from the previous frame accumulates tracking error. In Nguyen’s algorithm[2], a watershed line that was determined by using the watershed segmentation[3] and the watershed line smoothing energy[4,2] becomes the new contour of a tracked subject. In this way, tracking errors are accumulated by always including the previous contour regardless of the intensity of the current Canny edges. Predicted contour computed from the previous frame is usually different from the exact contour for the current frame. A big change between the previous and current contour shapes makes this kind of contour tracking difficult.
2
Our Approach
To overcome Nguyen’s two problems, difficulty in removing noisy background edges and accumulating tracking errors, we propose a new method to increase subject tracking accuracy by using LOD Canny edge maps. We compute a predicted contour as Nguyen does. But to reduce side-effects because of irrelevant edges, we start our basic tracking contour using simple (strong) Canny edges. Strong Canny edge map is generated by pixel-wise union of simplest Canny edge maps out of various scaled Canny edge maps. A Scanny edge map does not have noisy background edges, and looks simple. Our accurate tracking is based on reducing effects from irrelevant edges by selecting strongest edge pixels only, relying on the current frame edge pixel as much as possible contrary to Nguyen’s approach of always combining the previous contour. For Canny edge maps generated with smaller image intensity gradient values, we call where N is the number of LOD Canny edge maps. has the simplest Canny edges generated from a set of large (strongest) intensity gradient value edges. has the most detailed Canny edges generated by an accumulation from largest (strongest) till smallest (weakest) intensity gradient valued edges. Basically, we rely only on Scanny to find reference Scanny pixels, called selected Scanny pixels, for contour routing. These selected Scanny pixels become start and end nodes in routing. LOD Canny edge pixels become nodes in routing, and LOD values of adjacent edge pixels determine routing costs between the nodes. We mean adjacent to be four-neighbor connected. From a set of adjacent selected Scanny edge pixels, we find segments of contours, called partial contour. In finding a partial contour, we find a best route to follow Canny edge pixels favoring stronger Canny edge pixels. Our pixel routing favors strongest
530
J. Park, T.-Y. Kim, and S. Park
Fig. 1. Overview of Our Single Frame Tracking
Canny edges and then seek additional edge pixels from according to the descending sequence of multi-level detailed edge pixels, following LOD in edge maps. To make a closed contour, we do final routing using the above partial contours and Scanny edges around the predicted contour.
3
Overview of Our System
Figure 1 shows an overview of our system to track a single image frame. As inputs, we get a previous image frame, denoted as frame and its corresponding tracked subject contour of input frame and a current image frame, denoted as frame From frame contour of frame and frame we compute a predicted contour for frame using subject motion [2]. We generate various detailed levels of Canny edge image maps for the input frame We select Scanny edges from the LOD Canny edge maps. From a Scanny edge map, we derive a corresponding distance map. Using the predicted contour, we find the best matching between the predicted contour and the Scanny distance map. Scanny edge pixels matching with the predicted contour become the frame of the contour build up. We call these pixels as selected Scanny contour pixels. Selected Scanny contour pixels is the reference pixels to start building a segment of a tracked contour, and they are stored in the selected Scanny found list. We route a path to connect adjacent selected Scanny contour pixels in the found list using LOD Canny edge pixels. If we finish connecting every adjacent selected Scanny contour pixel pair, we get a set of partial contours although they is not guaranteed to be complete. We mean complete because the contour is four-neighbor connected and follows every possible Scanny edge. To build a closed contour for the frame we use Scanny edge maps around the predicted contour as well as a set of partial contours. Before this process, we fix wrong computed basic contour. One of the wrong basic contour is a contour with the similar color inside and outside of the contour. For this purpose, we erase Canny edge pixels around the predicted contour pixel. The resulting contour becomes the contour of frame
LOD Canny Edge Based Boundary Edge Selection
531
Fig. 2. Scanny(a), Canny edge maps(b), predicted contour from frame (c), distance map generated from Scanny(d), matching between predicted contour and Scanny distance map(e)
4
Generating Level-of-Detail Canny Edge Maps and Contour Pixel Selection
Scanny Contour Pixel Selection. Nguyen[2] removed background edges using subject motion. But, Nguyen’s approach left many irrelevant edges in the following cases: 1) an edge segment that has the same direction as subject motion and length that exceeds the length of subject motion, and 2) inner edges of a tracked subject. These irrelevant edges prohibit accurate contour tracking. We do not remove any background edges. Figure 2(a,b) shows an example of Scanny and Canny edge maps. Rather than removing background edges, we start with a Scanny edge map, as presented in Figure 2(a). By computing a subject motion as others do [2], we get a predicted contour as presented in Figure 2(c). Then, we generate a distance map of Scanny as in Figure 2(d). Given a pixel on the predicted contour, we select the corresponding Scanny edge pixel, if one exists, by matching between the predicted contour and the distance map of Scanny edge map. We call this pixel as a selected Scanny contour pixel. Figure 2(e) shows a best matching with the reference contour pixel point (marked as red cross). The green contour denotes the predicted contour, while black edge pixels denote Scanny edge pixels. From the matching along the predicted contour, we get a found list of selected Scanny contour pixels. Selected Scanny contour pixels is the reference pixels to start building a segment of a tracked contour, and they are stored in the selected Scanny found list. Scanny Contour Pixel Connection. We route a path to connect adjacent selected Scanny contour pixels in the found list using LOD Canny edge pixels. We mean adjacent to be adjacent in the found list. If we finish connecting every adjacent selected Scanny contour pixel pairs, we get a set of partial contours although they are not guaranteed to be complete. We mean complete because the contour is four-neighbor connected and follows every possible Scanny edge. These selected Scanny pixels become start and end nodes in routing. LOD Canny edge pixels become nodes in routing, and LOD values of adjacent edge pixels determine routing costs between the nodes. In finding a partial contour, we find a best route to follow Canny edge pixels favoring stronger Canny edge pixels. We mean best because building an optimal partial contour route by taking pos-
532
J. Park, T.-Y. Kim, and S. Park
sible strongest Canny edges according to the descending sequence of multi-level detailed edge pixels, following LOD in edge maps. Our Canny edge tracing to find a route to connect selected Scanny contour pixels is done using the concept of LOD. The LOD Canny edge maps consist of various levels in Canny edge generation. The Scanny edge pixels are assigned LOD value one, LOD value two for edge pixels, and LOD value (N +1) for and so on. LOD value 255 is reserved for pixels with no edge. We take a part of the LOD Canny edge map around two adjacent selected Scanny contour pixels. Pixels of the LOD map become nodes, and we determine weights between adjacent pixels according to equation (1). We determine weights between adjacent pixels using a Canny edge LOD value of each pixel. We favor traversing the most simple (stronger) edge pixels in the map. We assign the lowest weight between two adjacent Scanny edge pixels to encourage Scanny-based routing. An LOD edge map I is a pair consisting of finite set of pixels, and a mapping that assigns to each pixel t in an LOD edge pixel value ranging from 1 to 255. An adjacency relation A is an irreflexive binary relation between pixels of The LOD edge map I can be interpreted as a directed graph with nodes that are the LOD edge pixels and with arcs that are the pixel pairs in A. sAt depends only on the four-connected neighbor of the pixels in the LOD map, and A path is a sequence of pixels where for is the origin, and is the destination of the path. We assume given a function that assigns to each path a path cost in some totally ordered set of cost values. The set of cost values contains a maximum element denoted by The additive cost function satisfies
where is any path ending at weight assigned to the arc
and
is a fixed nonnegative
This weight function guarantees to penalize routing from low LOD valued pixel node to higher LOD valued pixel node. If there is no edge pixel present, the routing takes an ordinary (not a Canny edge) pixel with value 255 to make a closed contour to make a closed contour if there is no Canny edge present. The routing is done using Dijkstra’s minimum cost routing algorithm. We route a path to connect each adjacent selected Scanny contour pixels pair in the found list. If we finish connecting all adjacent selected Scanny contour pixels pairs, we get a set of partial contours although they are not guaranteed to make a closed contour for the tracked subject. Complete Contour Build Up. To build a closed and complete contour for the current frame, we use Scanny edge maps around the predicted contour as well as a set of partial contours computed from selected Scanny edge pixels. The resulting contour becomes the contour of the current frame. To get a globally best
LOD Canny Edge Based Boundary Edge Selection
533
contour, we mean best that the contour is four-neighbor connected, closed and follows every possible Scanny edges, we run a final routing using the computed basic contour and Scanny edges around the computed contour. We mean global considering the entire contour rather than considering a part of the edge map. In computing the final contour, we consider Scanny edge pixels rather than all LOD edge pixels to reduce number of nodes in the routing computation. The resulting contour becomes the contour of frame the current frame. For the final contour routing, consists of Scanny pixels as well as the computed partial contour pixels, and has dual values each for Scanny and the computed contour pixels. values for Scanny edge pixels are value one, and the computed partial contour pixels have value two. The weight function for the final routing is as follows:
We assign cost one between adjacent Scanny pixels, while higher cost between pixels of the computed basic contour. This has an effect of favoring Scanny edges rather than computed contour pixels. If there is no route made by Scanny pixels for a special part of an edge map, then a corresponding segment of the computed partial contour is selected.
5
Experimental Results
Experimental Environment. We have experimented with easily available video sequences either available on Internet or generated with a home camcorder, SONY DCR-PC3. We have generated 64 different LOD Canny edge maps, order them according to the number of Canny edge pixels, and union simplest six (top 10 percent) Canny edge maps to make Scanny Canny edge map. Figure 3(a-e) shows a man walking in a subway hall. The hall tiles as well as a cross stripe shirt generate many complicated Canny edges. The tracked contour shape and color changes as the man with a cross stripe shirt rotates from facing the front to the back as he comes closer to a camera and then moves away from it. There are many edge pixels in the background and the subject has many edges inside the tracked contour. There are other people moving in different directions, in the background. To make tracking more difficult, the face color of the tacked subject is similar to the hall wall color while his shirt color is similar to that of stairs, and tracked body black hair is interfered by persons in Figure 3(b-e)). Stair colors in Figure 3(b,e) are similar to the tracked subject shirt color. Our tracked contour is bothered by these interferences, but recovers as soon as we get Scanny edges for the interfered part. Even under this complex circumstance, our boundary edge-based tracking was successful. Figure 3(f-j) shows tracking a popular ping-pong ball. The movie sequence was downloaded from Internet. It is not easy to track an object with small
534
J. Park, T.-Y. Kim, and S. Park
Fig. 3. Three tracking results
number of pixels. Tracking a high speed ping-pong ball was successful until occlusion by a player’s hand in Figure 3(j). Figure 3(k-o) is tracking a man with strong textured shirt. Because this shirt makes many Scanny edges inside the tracked body, and the Scanny edges are connected, the contour tries to shrink because our approach tries to find a short edge route. This side effect can be reduced by changing our routing cost function of preserving the previous contour shape. Handling Occlusion. We assume our subject is never occluded by any background objects, but it occludes other objects in the background. Our tracking condition is tougher to track than the experimental environment by Nguyen [2]. A series of occlusion occurs in frames (Figure 3(b-e)). We suffer serious interference whenever similar colored moving objects are occluded by the tracked subject. The hair color of a background woman is the same as that of the tracked subject, and the contour is disturbed as she moves right. The following bold-haired man more seriously interferes the tracked subject, and the tracked contour is seriously deformed due to the similar color with the tracked subject. When the background object moves away from the tracked subject, we get strong Canny edges back between the tracked subject and the background object, and get a heavily deformed tracked contour. When the background subject is gone, there is another strong Canny edge maps generated by wall tiles. But the tracked contour due to the wall tile has similar colors around inside/outside of the tracked contour. According to final process to erase Canny edges around the wrong tracked contour pixels, edges because of wall tiles were erased. Because our contour routing favors short routes, the tracked contour successfully shrinks to our tracked
LOD Canny Edge Based Boundary Edge Selection
535
subject in several tracking frames. Full tracking movies can be downloaded from http: //www. cs. hongik. ac. kr/~ jhpark.
6
Conclusion
In this paper, we proposed a brand-new method of improving accuracy in tracking a highly textured subject. We start by selecting a boundary edge pixel from the simple (strong) Canny edge map, referring to the most detailed edge map to get edge information along the LOD Canny edge maps. Our basic tracking frame is determined from the strong Canny edge map and the missing edges are filled by the detailed Canny edges along the LOD hierarchy. This has an effect of Nguyen’s background noisy edge removal. Another major contribution of our work is not accumulating tracking errors. We minimize the possibility of accumulated tracking error by relying on the current Canny edge map only. If there is no edge present, we may have a tracking error for the part. Whenever we get Scanny edge information back, the tracking error disappears, and we can restart accurate tracking for the erroneous part. The problem with our approach is that we need edge information as every other edge-based approaches do. If there is no edge information available because of the same color with background, our tracking performance degrades heavily, and this is inevitable for all approaches. But our tracking performance recovers whenever we get edge information back. By using our novel method, our computation is not bothered by noisy edges resulting in a robust tracking. Our experimental results show that our tracking approach is reliable enough to handle a sudden change of the tracked subject shape in a complex scene. 1
References 1. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence 22 (2000) 266–280 2. Nguyen, H.T., Worring, M., van den Boomgaard, R., Smeulders, A.W.M.: Tracking nonparameterized object contours in video. IEEE Trans. on Image Processing 11 (2002) 1081–1091 3. Roerdink, J.B.T.M., Meijster, A.: The watershed transform: Definition, algorithms and parallelization strategies. Fundamenta Informaticae 41 (2000) 187–228 4. Nguyen, H.T., Worring, M., van den Boomgaard, R.: Watersnakes: energy-driven watershed segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2003) 330–342 5. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1 (1987) 321–331 6. Peterfreund, N.: Robust tracking of position and velocity with kalman snakes. IEEE Trans. on Pattern Analysis and Machine Intelligence 21 (1999) 564–569 7. Fu, Y., Erdem, A.T., Tekalp, A.M.: Tracking visible boundary of objects using occlusion adaptive motion snake. IEEE Trans. on Image Processing 9 (2000) 2051– 2060 1
This work was supported by 2004 IITA grant, contract no. 04-basic-104.
Object Boundary Edge Selection for Accurate Contour Tracking Using Multi-level Canny Edges Tae-Yong Kim1, Jihun Park2, and Seong-Whan Lee1 1
Department of Computer Science and Engineering Korea University, Seoul, Korea {tykim,swlee}@image.korea.ac.kr 2
Department of Computer Engineering Hongik University, Seoul, Korea
[email protected]
Abstract. We propose a method of selecting only tracked subject boundary edges in a video stream with changing background and a moving camera. Our boundary edge selection is done in two steps; first, remove background edges using an edge motion, second, from the output of the previous step, select boundary edges using a normal direction derivative of the tracked contour. In order to remove background edges, we compute edge motions and object motions. The edges with different motion direction than the subject motion are removed. In selecting boundary edges using the contour normal direction, we compute image gradient values on every edge pixels, and select edge pixels with large gradient values. We use multi-level Canny edge maps to get proper details of a scene. Detailed-level edge maps give us more scene information even though the tracked object boundary is not clear, because we can adjust the detail level of edge maps for a scene. We use Watersnake model to decide a new tracked contour. Our experimental results show that our approach is superior to Nguyen’s.
1
Introduction and Related Works
Tracking moving objects (subjects) is a hot issue because of a wide variety of applications in computer vision such as video coding, video surveillance, and augmented reality. This paper addresses the problem of selecting boundary edges for robust contour tracking in a single video stream. We can classify the methods of representing an object contour into two categories depending on the method used; parameterized contour or nonparameterized contour. In tracking a parameterized contour, an object contour representing the tracked subject is represented by using parameters. These methods use Snake models[1] in general; Kalman Snake[2] and Adaptive Motion Snake[3] are popular Snake models. In the method of tracking a nonparameterized contour, A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 536–543, 2004. © Springer-Verlag Berlin Heidelberg 2004
Object Boundary Edge Selection for Accurate Contour Tracking
537
an object contour as an object border is represented. The contour created by these algorithms is represented as a set of pixels. Paragios’s algorithm[4] and Nguyen’s algorithm [5] are popular in these approaches. Nguyen removed background edges by using object motion. But Nguyen’s approach left many irrelevant edges that prohibit accurate contour tracking. To overcome this problem, this paper proposes the method of selecting only the edges in the boundary of the tracked object. In order to increase contour tracking accuracy, we remove background edges using edge motions. The background edges whose motion directions are different from that of the tracked subject are removed. After background edge removal, we compute average intensity gradient in the normal direction of the previous frame contour, and consider only the edges with high gradient values as the boundary edges of the tracked object. We use multi-level Canny edges to get a proper detail of a scene. Thus, we can obtain robust contour tracking results even though an object boundary is not clear.
2
Efficient Contour Tracking
Nguyen[5] proposed a method for tracking a nonparameterized object contour in a single video stream. In this algorithm, a new tracked contour was determined by a watershed algorithm[6] with a watershed line smoothing energy[7,5] added in the energy minimization function. The new tracked contour is the border between a tracked object and background areas. In the step of new contour detection, Nguyen used two edge indicator functions, and is an edge indicator function from the predicted contour, and is an edge indicator function computed from the edge map resulting after background edges are removed by object motion vector. We create a boundary edge map, from an edge map resulting from background edge removal by edge motion, and create from We use two edge indicator functions, and for deciding a new contour. Figure 1 shows an overview of our tracking method. Processes inside the dash-line box denote our contributions.
Fig. 1. Overview of our tracking method.
3
Boundary Edge Selection
This section explains the method of selecting only boundary edges, improving the accuracy of object contour tracking.
for
538
3.1
T.-Y. Kim, J. Park, and S.-W. Lee
Background Edge Removal
Nguyen[5] removed background edges by using object motion vector, But, Nguyen’s approach left many irrelevant edges in the following cases: 1) an edge segment that has the same direction as and its length exceeding the magnitude of 2) a highly textured background, and 3) inner edges of a tracked object. These irrelevant edges prohibit accurate contour tracking. We compute a tracked subject motion and background edge motions to remove background edges. The background edges whose motion directions are different from that of the tracked subject are removed. Edge motion is computed using optical flow[8]. We use Canny edge generator for edge generation, and compute optical flow from the edge map. The tracked subject motion vector is computed to be and each edge pixel motion vector computed is tested against If the difference between two vectors are bigger than a specified constant we consider it to be a background edge pixel. Let be the edge map detected at the current frame. Vector is the computed optical flow of an edge pixel in The dominant translation vector is estimated by
where is the velocity space and in frame
is pixels that belong to an object area
is an edge map of subtracted by where is a background edge map. The background edge removal method using edge motion removes edges with different motion than the tracked subject, and this method is independent of the degree of complexity in the edge map, while accurately removing all background edge pixels with different motion. The edge map without background edges is used in selecting the boundary edges.
3.2
Calculating an Image Gradient in a Contour Normal Direction
In this paper, we present a novel method of removing noisy edges by computing an image gradient using the previous frame contour. We compute an image intensity gradient in the normal direction of the contour. Suppose a contour is parametrically represented as
The tangent direction of
as presented in Figure 2(a). The orthogonal direction of consider only the image gradients in the direction of intensity function.
is
is We is an image
Object Boundary Edge Selection for Accurate Contour Tracking
539
By extending equation (3), we compute an average color gradient, along the normal direction at a pixel point on the contour. is one of a pixel point of The computational process of is as follows: (i) Make an ellipse with two major axes of and directions. Its size is adjusted properly. (ii) Separate the pixels inside the ellipse into two parts using a line along the direction in Figure 2(a). (iii) Calculate the mean intensity values of the pixels in two separate areas that were separated by in Figure 2(a). The result of the computation is
Fig. 2. (a) A normal direction of the parametric contour and an ellipse with two inside areas separated by a contour for calculating (b) and the contour normal directions at the pixels that belong to for calculating
3.3
Boundary Edge Pixel Selection
Boundary edge pixels are selected after background edge removal by edge motion. As explained in Section 3.1, is the edge map resulting from background edge removal by edge motion. is a parametric representation of a predicted contour, and the total number of pixels on is N. Let be the i-th pixel of Boundary edge pixel selection process is done along We process the selection on every pixel point, where on We consider which is a part of the edge map, along has edges in a circular area centered at with radius a specified constant. is one of edge pixels in is considered to be one of a pixel point of a parametric curve translated from by is a contour translated from by The left side of Figure 4 shows and noisy edge pixels, The right side of Figure 4 shows a close up of a circular area of radius centered at This circular edge map is denoted as We compute a gradient of a normal direction of at of pixel point of every edge on To detect possible pixels with large image intensity change for boundary edges, we compute a gradient of a normal direction of at every edge pixel point of where is a specified constant. is a set of the pixels of on in the circular area of radius centered at is the sum of computed along the pixels of with reference at Figure 2(b) shows and the contour normal directions at the pixels that belong
540
T.-Y. Kim, J. Park, and S.-W. Lee
Fig. 3. Results of Canny edge detections in three different levels.
to for computing k boundary edges.
We use
large valued
in selecting
Figure 5 shows the process of selecting pixels for boundary edges. We compute Canny edge map and in each level. Multi-level Canny edges are results of Canny edge detection depending on the given thresholds. Figure 3 shows results of Canny edge detections in three different levels given a single image. We control the level of detail of a scene using multi-level Canny edge maps. Detailed Canny edge map of a scene confuses our tracking, while very simple edge map misses tracking information. is computed in level has edges in a circular area centered at with radius in level At the i-th computation loop, if the number of is smaller than we use of one step lower level, where is a specified constant. In other words, we use a more detailed Canny edge map if an object boundary is not clear. Therefore we can obtain robust tracking results although an object boundary is not clear. is the number of edge pixels in At the i-th computation loop, we select edge pixels with large values. is the number of level of Canny edges computed.
Fig. 4. A predicted contour and image operations along the contour. The operation is done on every edge pixel in a circular area.
Object Boundary Edge Selection for Accurate Contour Tracking
541
Fig. 5. The process of boundary edge selection.
4
Contour Tracking with Selected Boundary Edges
An overview of our tracking process is shown in Figure 1. User inputs an initial contour of a tracked object at the first frame. In the steps of contour detection, using the concept of topographical distance, the watershed segmentation is done by a minimization[5,7]. For this algorithm, we use two edge indicator functions, and derived from and respectively. An algorithm for edge indicator function is given in Nguyen’s paper[5]. The boundary edge map, is obtained by an algorithm proposed in this paper. is obtained by translating contour of image frame by A watershed line extracted using two edge indicator functions, and becomes a new contour for the current frame.
5
Experimental Results
Figure 6 shows images with background edges removed by Nguyen’s approach [5] and boundary edges selected by our approach. Figure 7 shows contour tracking results of a movie clip. We selected the boundary edges with and in Figure 6 and Figure 7. Outputs of background edge removal by Nguyen’s approach leaves many irrelevant edges, as shown in Figure 6, which prohibit accurate contour tracking. Figure 7 shows the tracking results in a subway hall. The hall tiles as well as the man’s cross stripe shirt generate many complicated Canny edges. The contour shape changes as the man with a cross stripe shirt rotates from facing the
542
T.-Y. Kim, J. Park, and S.-W. Lee
front to the back. The size of the tracked subject changes as the man comes closer to a camera and then moves away from it. There are many edge pixels in the background and the subject has many edges inside the tracked contour. There are other people moving in different directions in the background. Under this complex circumstance, Figure 7(a-h) shows our boundary edge-based tracking was more successful than Nguyen’s (Figure 7(i-p)). Walking people crossing our subject did not affect our tracking performance. A full tracking movie can be downloaded from http://www.cs.hongik.ac.kr/ ~jhpark/tykim0324mpg.avi
Fig. 6. (a) Two consecutive frames and a contour determined at the previous frame(marked by a white outline). (b) An output of background edge removal by Nguyen’s approach (c) Outputs of boundary edge selections with k = 1, 3, 5.
6
Conclusion
In this paper, we proposed a novel method of improving accuracy in tracking the contour of a highly textured object. We select only the edges around the tracked object boundary to overcome the noisy edge problem because of a complex scene. In order to remove background edges using the edge motion, we compute tracked subject motion and edge motions. The edges with different motion direction than the subject motion are removed. Then, we compute image intensity gradient in the normal direction of the previous frame contour to remove redundant edges from the edge map resulting after background edges are removed by edge motion. We can obtain robust contour tracking results, even though object boundary is not clear, by using multi-level Canny edges from a variety of Gaussian parameter. By considering only the normal direction of the contour, we ignore edges with different slope than that of the subject boundary. The gradient computation in average intensity change involves a concept of considering change of a textured area divided by the contour. By using these methods, our computation is not bothered by noisy edges or small cross stripe textures, resulting in a robust contour tracking. Our experimental results show that our contour tracking approach is reliable enough to handle a sudden change of the tracked subject shape in a complex scene and our boundary edge-based tracking is more successful than Nguyen’s approach. 1 1
This work was supported in part by 2004 IITA grant, contract no. 04-basic-104.
Object Boundary Edge Selection for Accurate Contour Tracking
543
Fig. 7. Comparison of tracking results (superimposed by a black outline)
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1 (1987) 321–331 2. Peterfreund, N.: Robust tracking of position and velocity with kalman snakes. IEEE Trans, on Pattern Analysis and Machine Intelligence 21 (1999) 564–569 3. Fu, Y., Erdem, A.T., Tekalp, A.M.: Tracking visible boundary of objects using occlusion adaptive motion snake. IEEE Trans. on Image Processing 9 (2000) 2051– 2060 4. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence 22 (2000) 266–280 5. Nguyen, H.T., Worring, M., van den Boomgaard, R., Smeulders, A.W.M.: Tracking nonparameterized object contours in video. IEEE Trans. on Image Processing 11 (2002) 1081–1091 6. Roerdink, J.B.T.M., Meijster, A.: The watershed transform: Definition, algorithms and parallelization strategies. Fundamenta Informaticae 41 (2000) 187–228 7. Nguyen, H.T., Worring, M., van den Boomgaard, R.: Watersnakes: energy-driven watershed segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2003) 330–342 8. Shi, J., Tomasi, C.: Good features to track. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1994) 593–600
Reliable Dual-Band Based Contour Detection: A Double Dynamic Programming Approach Mohammad Dawood1,2, Xiaoyi Jiang1, and Klaus P. Schäfers2 1 Department of Computer Science, University of Münster, Einsteinstr. 62, 48149 Münster, Germany
[email protected] 2 Department of Nuclear Medicine, University Hospital of Münster, Albert-Schweizer-Str. 33, 48149 Münster, Germany
{dawood,schafkl}@uni-muenster.de
Abstract. Finding contours in constrained search space is a well known problem. It is encountered in such areas as tracking objects in videos, or finding objects within defined boundaries. One main concern is to restrict the search space. Different solutions to this problem have been proposed. Presented in this paper is a double dynamic programming approach which is both optimal and computationally fast. It employs dynamic programming for finding correspondence between pairs of pixels on the inner and outer boundary of a band, which is found through morphological transforms and restricts the search space. In a second step dynamic programming is used again to find the exact contour inside this restricted search space. Keywords: Contour Detection, Search Space, Dynamic Programming, Object tracking, Dual Band
1
Introduction
Finding contours in restricted search space is an important improvement on the more general problem of finding contours, because it reduces the computational cost significantly. Reducing the search space also makes the contour detection more reliable. This is the case in such applications as tracking the motion of an object in a video, matching the boundaries of objects on an image with a template, segmenting objects in medical images where a rough contour is given, etc. In these cases the object(s) whose contour is to be found moves or deforms over a sequence of images, whereby the motion or deformation between two consecutive images is comparatively small. The contours of the object are thus in the neighbourhood of the initial contour found in a previous image or given as a template. Thus it is computationally rewarding to restrict the search space for finding the contours to this neighbourhood only. Different approaches have been proposed to restrict the search space. Our idea is to form a dual-band around a known or initial contour and find the contour in the target image within this band. Mostly one or another form of active contours, also called snakes, are used to form this dual-band (see [2] for A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 544–551, 2004. © Springer-Verlag Berlin Heidelberg 2004
Reliable Dual-Band Based Contour Detection
545
a few examples). Almost all of these methods require a manual implantation of both the internal and external limits of the dual-band, such as in case of double active contours approach or they use one manually implanted contour and search a fixed number of pixels to its inside and outside. A simple method of restricting the search space via dual active contours (snakes) was proposed by Gunn et al [7]. They used two snakes at the same time, one of these is set inside the target object and expands, whereas the other one is set outside the object and contracts. Both of them are interlinked to provide a driving force that can overcome local minima, a problem faced by many snake implementations. For interlinking the snakes the arc length is used. The inner and outer snakes thus make the inner and outer boundaries of the search space. Another approach, proposed by Giraldi et al [6], is that of dual topologically adaptable snakes. These snakes not only take the ordinary energy functions into account but also are dependent upon the topology of the surface they are upon. As in all dual snake models one snake expands from inside the object and the other contracts from the outside. Both are linked to each other through a function that allows them to overcome local minima. A different approach was implemented by Georgoulas et al [5]. In accordance with the problem of finding the center of a cylinder, they constructed two circles with center in the midst of the image thereby restricting the search space. Selected pixels, equal in number, on both circles were connected by straight lines. The snake was then allowed to only move along these straight lines. This method is very application specific and can only be applied to circular structures. Moreover it presumes that the object is in the middle of the image. Another method was proposed by Aboutanos et al [1] and recommended by Dawant et al [3]. In this case a basic contour is first marked manually. Normals to this contour are then constructed. The length of the normals restricts the search space. The pixels along the normals are then transferred to polar coordinate system and the exact contour is then found with the help of dynamic programming. However, the problem with this method is that it can only be applied to objects which are somewhat convex in nature. At places where the object has sharp corners or pointed peaks, the normals will cross each other and thus the contour finding may fail there. Another variant of this is suggested by Erdem et al [4]. They segment the initial object and use its contour for restricting the search space. The normals to the contour are used to define the limits of the search space. The contour is then found with the help of snakes. Presented in this paper is a new method to restrict the search space to a dualband of predefined width. The inner and outer boundary of the search space is interlinked with straight lines that are guranteed not to cross each other and thus allow complex shapes which cannot be processed by taking the normals so easily. The actual boundary is then detected inside this search space. For both these steps dynamic programming is used which gives a fast and optimal solution.
546
M. Dawood, X. Jiang, and K.P. Schäfers
Fig. 1. An object and its contour.
2
Our Approach
In this section we will explain the basic algorithms and their application to the problem of finding contours in restricted search space. The section is divided into following subsections. First, the method used to restrict the search space is described i.e. how the dual-band is defined. Second, the method of finding correspondences between the outer and inner boundaries of the dual-band is explained. This is done through dynamic programming. Third, the method of finding the object in the restricted search space is given, which is again a dynamic programming algorithm.
2.1
Restricting the Search Space
The search space is restricted by the use of a dual-band. This dual-band is formed using the initial contour. A fixed number of pixels to the inner and outer side of the initial contour give the dual-band in which the object will be sought in the target image. The initial contour is found by any method for contour detection. The method may differ in accordance with the needs of the particular instance of the application. It should have no effect on the rest of the procedure proposed here, as the initial contour is only used to have a rough estimate of the position of the object. We have used simple thresholding and labeling in the example given in figure 1 to find the initial contour. All pixels which are around this initial contour up to some user defined distance form the dual-band. The width of the band is so chosen as to allow the object in the target image to be inside this search space. The eucledian distance transform of the initial contour is a good and efficient way of doing this. The inner and outer boundaries of the dual-band are the isocontours at pixels distance from the object to the inside and pixels distance to the outside. The dual-band thus formed from the initial contour in figure 1 is shown in figure 2 and
Reliable Dual-Band Based Contour Detection
547
Fig. 2. The dual-band which defines the search space.
2.2
Correspondence Between the Inner and Outer Boundaries of the Search Space
Now that we have the dual-band, the next step is to find the correspondence between the pixels on its inner and the outer boundaries. This is required for boundary tracking algorithms, e.g. dynamic programming algorithm, which need a sequence of pixels . A simple approach to this is to use the normal vectors to an initial contour as done in [1] and [2]. But this is inaccurate as it may lead to criss-crossing of the normals which results in loops while tracking the target contour. Our idea is to inter-connect the pixels in a way, so that the connecting lines do not cross each other. We utilize the dynamic programming technique for this purpose. Each pixel is connected to a counter part on the other side of the dual-band such that the total length of all connecting lines is minimized among all possibilities. As dynamic programming works with a sequence of pixels it automatically avoids the crossing-of-lines problem, see figure 3. Technical realization. The boundary pixels are present in the form of two contour lists, one each for the inner and outer boundary of the dual-band. The list at position contains the and the coordinates of the nth pixel of the boundary. Selection of the start points of the contour lists for the inner and outer boundary is necessary as the contours are closed and thus the lists are circular. For this both contour lists are rotated in a way so that the start positions of both lists have the shortest possible distance. If more then one such pairs exist, it is sufficient to take any one of them, e.g. the first such pair. Dynamic programming is now applied to find the best correspondence between the pixels of the two lists. The cost function we sought to minimize via dynamic programming is the sum total of the distances between all corresponding pairs of the band. Thus the global solution will give that set of correspondence between all pairs of pixels on the inner and outer boundary that results in the smallest possible sum of all distances. Any distance measure can be used as a cost function. We have used the eucledian distance measure. However, it should be remembered that we are trying to find a correspondence between the two boundaries of the band. As such, the exact method of correspondence is not very crucial to the second step of finding the actual contours.
548
M. Dawood, X. Jiang, and K.P. Schäfers
Fig. 3. The correspondence between the inner and outer boundry pixels of the dualband. Only every third pair is shown for the sake of visibility.
2.3
Finding the Object in Restricted Search Space
The second part of this method applies to the detection of contours in the restricted search space obtained from step one. The pixels in the dual-band are now rearranged in a matrix form, suitable for the dynamic programming contour detection algorithm. Each straight line from the corresponding dual-band as shown in figure 3 forms one row of the matrix. An appropriate cost function is now defined on this matrix. The dynamic programming algorithm then finds the most cost-effective path. Dynamic programming algorithm is based upon the idea that if a path from ‘a’ to ‘c’ is optimal and passes through ‘b’, then each of the paths from ‘a’ to ‘b’ and ‘b’ to ‘c’ are also optimal. In this way sequentially optimal paths can be calculated from top to bottom of a matrix. The costs at each step are added to the already visited positions of matrix. The least cost at the bottom row is then traced back to the top and gives the optimal result. See [3] for a detailed view of the algorithm. The definition of the the cost function is very important but also very application specific. Therefore a general function for boundary detection is not given here. Depending upon the data available different functions can be conceived for this purpose e.g. a six term cost function was used in [1]. We have used the sum of a term derived from the grayscale of the target image and a distance term as the cost function. The latter term is a function that gives the distance of any point of the image from the closest edge point. This term causes the boundary to be pushed towards the edges of the target object. Both values are weighted 1:7 respectively: where position
is the histogram-equalized grayscale value of the target image at and DT is the distance transform function. We have used the Sobel
Reliable Dual-Band Based Contour Detection
549
Fig. 4. The target image and the dual-band imposed on it. The target contour has been successfully detected in the right most image.
operator as edge detector. Dynamic programming is then used for detecting the best contour in this matrix. The result is shown in figure 4.
3
Preliminary Experimental Results
The results of the algorithm on a different sequence of images from the same video are presented in figure 5 to show how it tracks the object through the images.
Fig. 5. A series of frames with detected contours.
As the method is universal in character it can also be used to segment images from different modalities such as PET (Positron Emission Tomography) and
550
M. Dawood, X. Jiang, and K.P. Schäfers
Fig. 6. An example from medical imaging. Left to right: CT image, PET image, Lungs segmented on CT image, lungs segmented on PET images.
CT images used in medical imaging. An experiment to this effect was done for segmenting lungs in PET/CT images and its results are shown in figure 6. The PET images are acquired over a period of typically 30-40 minutes. Due to the breathing motion during this time and difference in the mode of acquisation, there is no full spatial correspondence between the PET and CT images. We first segmented the lungs on the CT images, which are far superior in quality to the PET images for segmentation purposes as they are density-based images whereas PET images are function-based, and then used these contours to segment the lungs on the PET images. The images for initial contour must not be of very good quality as they are only used for initialisation and as a rough estimate.
4
Conclusions
We have presented an efficient method of restricting the search space and finding contours in it for a large range of applications. The method is based on the dynamic programming technique and gives globally optimal results. Furthermore, it is a fast and stable method. The costs of dynamic programming algorithm on a matrix are O(mn). Restricting the search space to the pixels in the dual-band makes it more efficient, relaiable and accurate. The algorithm is non-iterative and deterministic. Our method can be used for tracking objects in video sequences or for segmentation, when the rough contour is known, such as in medical images. The effectiveness and validity of the method was demonstrated on real life images. The method of finding correspondence between the boundary pixels of the dual-band is applicable to highly complex shapes, as demonstrated in figure 7. It is successful at avoiding any criss-crossing of the correspondence lines. The dual band can be defined in different ways, such as translation, rotation, affine transformation, or the one we used in this example i.e. eucledian
Reliable Dual-Band Based Contour Detection
551
Fig. 7. An example of finding correspondence between the contour boundaries in a complex shape. Every fourth pair is shown for better visibility.
distance. The definition of dual band can thus be selected in accordance with the application and further increase the efficiency of the algorithm. Besides the use in contour detection through dynamic programming, our method of defining the dual-band suggests itself as a method of initialising snakes in dual-snake approaches. The correspondence finding algorithm provides an efficient way of linking the inner and outer snakes. Automatic linking of the points of the snakes that correspond to each other is possible this way. Future reseach should be directed at extending this approach to 3D images. Acknowledgements. The authors want to thank K. Rothaus and S. Wachenfeld for valuable discussions.
References 1. G B Aboutanos, J Nikanne, N Watkins and B M Dawant: Model Creation and Deformation for the Automatic Segmentation of the Brain in MR Images. IEEE Transactions on Biomedical Engineering, 1999, Vol 46(11), pp 1346-1356. 2. A Blake and M Isard: Active Contours Springer, London 1998. 3. B Dawant and A P Zijdenbos: Image Segmentation, in Handbook of Medical Imaging, Vol 2, Medical Image Processing and Analysis, 2000, pp 71-127. 4. C E Erdem, A M Tekalp and B Sankur: Video Object Tracking with Feedback of Performance Measures. IEEE Transactions on Circits and Systems for Video Technology, 2003, Vol 13(4), pp 310-324. 5. G Georgoulas, G Nikolakopoulos, Y Koutroulis, A Tzes and P Groumpos: An Intelligent Visual-Based System for Object Inspection and Welding, Relying on Active Contour Models-Algorithms. Proceedings of the 2nd Hellenic Conference on AI, SETN April 2002, Thessaloniki, Greece Companion Volume, pp 399-410. 6. G A Giraldi, L M G Gonçalves, and Antonio A F Oliveira. Dual Topologically Adaptable Snakes. In Proceedings of CVPRIP’2000 International Conference on Computer Vision, Pattern Recognition and Image Processing. Atlantic City, USA, February 2000, pp 103-107. 7. S R Gunn and M S Nixon: A Robust Snake Implementation: A Dual Active Contour. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, Vol 19(1), pp 63-68.
Tracking Pedestrians Under Occlusion Using Multiple Cameras* Jorge P. Batista ISR-Institute of Systems and Robotics, DEEC/FCT University of Coimbra, Coimbra - PORTUGAL
[email protected]
Abstract. This paper presents a integrated solution to track multiple non-rigid objects (pedestrians) in a multiple cameras system with ground-plane trajectory prediction and occlusion modelling. The resulting system is able to maintain the tracking of pedestrians before, during and after occlusion. Pedestrians are detected and segmented using a dynamic background model combined with motion detection and brightness and color distortion analysis. Two levels of tracking have been implemented: the image level tracking and the ground-plane level tracking. Several target cues axe used to disambiguate between possible candidates of correspondence in the tracking process: spacial and temporal estimation, color and object height. A simple and robust solution for image occlusion monitoring and grouping management is described. Experiments in tracking multiple pedestrians in a dual camera setup with common field of view are presented.
1
Introduction
Tracking people in relatively unconstrained, cluttered environments as they form groups, and part from one another requires robust methods that cope with the varied motions of the humans, occlusions, and changes in illumination. When occlusion is minimal, a single camera may be sufficient to reliably detect and track objects, although, in most cases, robust tracking of multiple people through occlusions requires human models to disambiguate occlusions [11]. However, when the density of objects is high, the resulting occlusion and lack of visibility requires the use of multiple cameras and cooperation between them so that the objects are detected using information available from all the cameras covering a surveillance area [9,13]. The approach described on the paper explore the combination of multiple cameras to solve the problem of autonomously detect and track multiple people in a surveillance area. Since no a priori model of people is available, the paper presents a tracking method based on appearance: tracking the perception of people’s movements instead of tracking their real structure. An improved image tracking mechanism that combines image segmentation and recursive trajectory estimation is proposed. The recursive approach is used to feedback into the *
FCT project POSI/SRI/34409/1999
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 552–562, 2004. © Springer-Verlag Berlin Heidelberg 2004
Tracking Pedestrians Under Occlusion
553
image tracking level the ground-plane predicted target information. The integration of this information in the image tracking level enables robust tracking of multiple pedestrians, helping to disambiguate problems of temporary occlusion (people crossing, forming and leaving groups) as well permanent occlusion situations (people standing behind closets). The ground-plane pedestrians trajectory prediction is obtained fusing the information supplied by the multiple cameras, managing people’s grouping. Several target cues are used to disambiguate between possible candidates of correspondence in the tracking processes: spacial and temporal estimation, color and object height. Experiments in tracking multiple objects in a dual camera setup with common field of view are presented. An accurate ground-plane trajectory prediction is obtained under several types of occlusion.
2
Multiple Target Detection
In this system, only moving objects are considered as targets (pedestrians). As the camera (sensor node) is fixed, target detection is based on a combination of motion detection and brightness and chromaticity distortion analysis. This approach allows a robust segmentation of shading background from the ordinary background or moving foreground objects. The background image model is regularly updated to compensate for illumination change and to include or remove in the background model the objects that stopped or started their movement in the field of view of the camera.
2.1
Moving Target Segmentation
Each pixel in a new image is classified as one of background (B), object (O), shadow (S) or highlight (H) and ghost (G). The clustering of foreground pixels is based on the following validation rules Object (foreground pixel)&[~(shadow/highlight)]&(in motion) Shadow/Highlight (foreground pixel)&(shadow/highlight) Ghost (foreground pixel)&[~(shadow/highlight)]&[~(in motion)] The distinction between objects, shadows and highlights among pixels not classified as background is made using the brightness and chromaticity distortion [4]. Representing the expected background pixel’s RBG value, the brightness distortion is a scalar value that brings the observed color close to the expected chromaticity line. It is obtained by minimizing representing the pixel’s strength of brightness with respect to the expected value. Color distortion is defined as the orthogonal distance between the observed color and the expected chromaticity line. The color distortion of a pixel is given by Applying suitable thresholds on the brightness distortion and the chromaticity distortion of a pixel yields an object mask according to
554
J.P. Batista
where and represent the normalized brightness distortion and normalized chromaticity distortion respectively. The three-frame difference rule suggests that a pixel in frame is moving if its intensity has changed significantly between both the current image and the last image and the current image and the next-to-last frame Based on this rule, an image mask for moving pixels is create. Moving pixels are clustered into connected regions defining a bounding box per region. Each pixel of the background image is modelled by a multidimensional Gaussian distribution in RGB space (mean and standard deviation). These parameters are updated with each new frame using the following linear filter
being used to control the rate of adaptation A critical situation occurs whenever objects stop their movement for a period or when objects modelled as being part of the background start moving. To deal with this situation, each pixel has a state transition map defining a dynamic pixel rate of adaptation. The state transition map will encode in all the moving object pixels the elapsed time since the beginning of the object movement. Different rates of adaptation are used according to
where
is the elapsed time since the target stopped its movement and being
the width and height of the bounding box respectively,
the frame rate and the image velocity components of the bounding box center of mass. Figure 1 shows the result of the target detection process in one of the static camera nodes.
2.2
Image Target Model
The target model adopted is composed of three primitives: the image coordinates of the point of contact of the pedestrian with the ground plane, the image coordinates of the head of the pedestrian, and the width of the bounding box measured at the center of mass. Assuming an upright walking posture for pedestrians, the coordinate pairs and are defined as the intersection of the line passing through the bounding box center of mass and the image vanishing point of the vertical posture with the top and
Tracking Pedestrians Under Occlusion
555
Fig. 1. Target segmentation with shadow detection (left) & Image target model (right).
bottom lines of the bounding box (fig. 1). Two additional target cues have been used: the color information associated to the blob and the estimated 3D height of the target. A target is defined by where represents the color histogram of the target’s blob and is the number of points of the target segmented blob.
2.3
Target Color Model
Color distributions have been effectively modelled for tracking using both color histograms and gaussian mixture models. Although both approaches perform well, color histograms have been used. To avoid problems due to changing light intensity, a simple color constancy algorithm was used that normalizes the R,G,B color components. Each normalized color component is quantized into 64 values (6 bits), yielding a total of 4096 histogram bins. An histogram simply counts the number of occurrences of within the detected blob for person Histogram models are adaptively updated by storing the histograms as probability distributions and updating them as where is the probability distribution obtained from current image, and Given a pair of histograms, and respectively for target and model each containing bins, the normalized histogram intersection can be obtained intersecting the discrete probability distribution histogram of the model and the target [7,12], through the relationship Color information is also used in the master node to disambiguate between possible candidates in the matching process. A simple color histogram union is used to combine the color information of target obtained from sensor nodes where is the normalized color histogram of target in the master node, and is the normalize color histogram of the target obtained by the sensor node is adaptively updated using the same approach proposed for the sensor nodes.
J.P. Batista
556
2.4
Target Height Model
Target height is modelled as being the image to ground plane homography transformation of each sensor node (section 4). Representing the ground-plane location of sensor node by the height of target is defined as
where
and
Knowing the height and ground-plane location of a pedestrian projection of his head and feet is modelled as coordinates of the head and feet image projection are obtained by
3
the image The
Single-View Tracking
The single-view tracking aim to track at the image level all moving targets detected and segmented by the image processing level. The target state vector is where and are the position and velocity of the model target feature points and is the width of the bounding box. The system model used is the following discrete model:
where is a discrete-time white noise process with mean zero and covariance matrix Q, is a discrete-time white noise process with mean zero and covariance matrix R, and and are uncorrelated for all and We considered the assumption that trajectories are locally linear in 2D and the width of the bounding box changes linearly, resulting for the system model the following linear difference equation where the system evolution matrix, is based on first order Newtonian dynamics and assumed time invariant. The measurement vector is and is related to the state vector via the measurement equation
3.1
Image Occlusion and Grouping Management
At this stage it is important to define the concept of an object. An object represents an image tracked target and can be of a single or compound nature. It is
Tracking Pedestrians Under Occlusion
557
represented by the descriptor where represents the object descriptor, the tracker parameters and the number of targets associated to the object is a list of pointers to the object descriptors that form the compound object To disambiguate between possible candidates of correspondence in the tracking process two image cues were used : spacial and temporal estimation and color. The Histogram color matching and the bounding boxes overlapping ratio
were used
to build correspondence matrices (CMat) between the a posterior estimated image position objects and detected targets for time frame represent the area of the bounding box and the bounding boxes overlapping area.
A unitary value at the bottom row represents a correspondence between the object and the target, values greater than one indicates the existence of object merges and null values indicates the existence of new detected targets. The last column of the matrix indicates the existence of an object split for values greater than one, a correspondence for unitary values and the lost of an object for null values. Based on the correspondence matrices, four managers, running in cascade, were used to handle the image objects: split manager, merge manager, new/lost manager and update manager. The Split manager- When a split situation is detected, two possible situations can happen: a compound object split (the most regular case) or a single object split. This last case can happen when a group enter the surveillance area and split. To handle the compound object split, the manager creates a new correspondence matrix between the objects that form the compound object and the image targets that were detected as split candidates. This time the correspondence is based on color histograms and target height, associating a segmented target to each object of the compound object. The descriptors of the objects are recovered from the compound object and added to the tracked object list, associating to each object the segmented target primitives of the target they matched. The compound object descriptor is removed from the tracked objects list and discarded. For the case of a single split, a new object is created and added to the new born object list. This new object is definitely moved to the tracked object list after being tracked for 5 consecutive frames. The Merge manager- When a merge situation is detected, a compound object descriptor is created and added to the list of object trackers, moving
558
J.P. Batista
the object descriptors of the merged objects from the tracked object list to a dying object list, decreasing its life vitality over a period of 10 frames, being definitely discarded after this period. The new object descriptor includes the color histograms and the 3D target height of the objects merged (compound objects descriptors) and also the number of targets merged. If a split situation is detected before the death of the objects (ex: objects crossing), the objects descriptors are recovered from the dying objects list to the tracked list. The New/Lost manager- When a null value is detected on the last row of the OR matrix this means that a new object was detected. A single object descriptor is created and included on a list of new born object increasing is life vitality over a stack of 5 frames. After this period, the descriptor is moved to the tracked object list. If a null value is detected on the last column of the OR matrix a lost object is considered to happen. Its descriptor is moved from the tracked object list to a dying object list decreasing its life vitality for a period of 10 frames over which it is definitely discarded. The Update manager- At this stage, the tracked object list has a complete object target matching, updating the object trackers with the segmented targets information. The feedback process supplies to each sensor node information about where and how many target should be detected at time taking observations from cameras that are fused at the ground-plane tracking level. This information is useful to cross-check the existence of groups and also to validate the cardinal of those groups by counting the number of projected targets that fall inside the bounding area of a detected group. This approach enables a more robust image split/merge and targets grouping.
4
Target Ground-Plane Mapping
Each one of the elements of the tracked object list has a state vector and an associated error covariance matrix obtained from the tracker. Each sensor node has an associated homography that maps image points into the ground plane surveillance area, mapping the tracked target’s primitives, and into the ground plane through the homography transformation Considering the existence of a certain uncertainty for the coordinates of and an uncertainty for the homography estimation, which are considered uncorrelated, the mapping of into the ground-plane will have an associated uncertainty that is given by where J represents the Jacobian matrices, represents the error covariance matrix obtained from the object Kalman filter tracker and the error covariance matrix obtained using the solution proposed by [3]. Since the 3D target height is modelled as that results on the equation 5, the uncertainty associated to the 3D target height is given by where is the Jacobian matrix of equation 5.
Tracking Pedestrians Under Occlusion
5
559
Ground-Plane Tracking
The ground-plane tracking level has two major purposes: merge the information mapped on the ground-plane by the sensor nodes and perform the groundplane tracking of the pedestrians detected by the sensor nodes, managing the group/ungroup occurrences. Pedestrians are tracked on the ground-plane using a Kalman filter tracker. The state vector is where is the pedestrian ground-plane position, is the pedestrian velocity and is the pedestrian acceleration. is the 3D target height. A constant acceleration model was adopted to the pedestrian movement and the height of the pedestrian was modelled as constant. The dimension of the measurement vector is dependent on the number of sensor nodes that are able to detect and track the pedestrian. Assuming this number to be the measurement vector is being related to the state vector via the measurement equation The dimension of the matrix C is The measurement error covariance matrix is defined using the uncertainty ground-plane mapping propagation described on previous section. The ground-plane tracked objects are referenced as Pedestrians and they are represented by the descriptor where represents the pedestrian descriptor (ground-plane position, height and histogram color), the tracker parameters and the number of pedestrians in case of a group. is a list of pointers to the pedestrian descriptors that form the group
5.1
Tracking and Group Management
At the ground-plane level the major problem to overcame is the group formation. A group is defined when a pedestrian is not visible as a single target in any of the sensor nodes. This definition allow the existence of single and compound groups. A single group is defined when a pedestrian creates a compound object with different pedestrians in each one of the sensor nodes. A compound group is defined when more than one pedestrian shares a common compound object in different sensor nodes. In both cases, the system is unable to obtain the groundplane position of the pedestrian directly from the sensor nodes. Correspondence matching between the pedestrian trackers and the mapped measurements from the sensor nodes is obtained using correspondence matrices. The Mahalanobis distance between the a posterior estimated pedestrian position and the ground-plane mapped position is used as a matching measurement. This correspondence is cross-checked by matching the image tracked objects with the projection of the a posterior estimated position of the head and feet of the pedestrian into the image sensor nodes (recursive projection). An example of a correspondence matrix for the binocular case is shown in figure 2. Four managers handled the correspondence and group formation: split, grouping, new/lost and update. The major difference between these managers and the ones used on the single view tracking lies on the grouping occurrence. A split occurrence is detected
560
J.P. Batista
Fig. 2. Ground-plane location of compound pedestrians and correspondence matrices
when a pedestrian match more than one target. The pedestrian descriptors information stored on the compound pedestrian descriptor are recovered and new trackers are created. Color information is used to match the new targets with the pedestrian descriptors. Solving the split occurrences, it is time to handle the grouping occurrences. Analyzing the information stored in the last row of the correspondence matrix several groups can be created, representing the pedestrians grouping. The correspondence matrix for camera 1 establish four groups and while camera 2 establish three groups The groups of cardinal one, like and allows the recovery of the ground-plane position of the pedestrians directly from the sensor nodes, which means that the trackers of the pedestrians can be updated with the measurement supplied by the sensor nodes. The remaining trackers are unable to be updated directly from the sensor nodes. For these cases, a novel solution was implemented to estimate the ground-plane position of the in group pedestrians. Each compound object map on the ground-plane the target primitives and defining a straight line on the ground-plane. Different groups define different lines and the estimated position of the pedestrian belonging to these groups is defined as the point that minimize the Euclidean distance to the lines. Figure 2 shows the outcome of this approach on a simulated situation considering the binocular case. Analyzing in detail what happens in this situation, the pedestrians and has grouped on the ground-plane defining a compound group whose location is obtained by the intersection of the lines defined by and the position of is obtained by the intersection of the lines defined by and the positions of and are obtained directly from sensor node 1 and the position of is obtained directly from sensor node 1 and sensor node 2.
6
Performance Evaluation and Results
The integration of multiple cameras to track multiple targets was analyzed in an indoor environment. Figure 3 show a few images of the indoor multiple pedestrian tracking with the ground-plane trajectories recovered for both pedestrians. The green boxes superimposed on the images represents the tracked objects while the blue ones represents the image projection (recursive trajectory) of the groundplane tracked pedestrians. The red dots on the top and bottom of the blue boxes
Tracking Pedestrians Under Occlusion
561
Fig. 3. Tracking two pedestrians with long-term grouping in an indoor environment with the estimated ground-plane trajectories coordinates.
corresponds to the projection of the feet and head of the pedestrian. The lines shown at the bottom represents the occurrence of merge situations at the image level.
7
Conclusions
The integration of several visual sensors for a common task of surveillance was presented. A simple and robust solution to handle image occlusion and grouping was proposed. The ground-plane pedestrian grouping and tracking was solved using a very simple solution, obtaining the ground-plane location of the pedestrian or group of pedestrians even in simultaneous camera grouping situations. Experimental results were presented with excellent results on tracking multiple pedestrians.
References 1. R. Collins, et al., A System for Video Surveillance and Monitoring. CMU-RI-TR00-12, Carnegie Mellon University, 2000. 2. Bar-Shalom,Y., Fortmann,T., Tracking and Data Association. Academic Press, Inc, New-York 1988.
562
J.P. Batista
3. Hartley,R., Zisserman,A., Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. 4. Horpraset,T., Harwood,D., Davis,L., A statistical approach for real-time robust background subtraction and shadow detection, ICCV’99 Frame Rate Workshop, 1999. 5. Haritaoglu,I., Harwood,D., Davis,L., Hidra-Multiple people detection and tracking using silhouettes, IEEE Workshop on Visual Surveillance, 1996. 6. Pieter,J., Crowley, J., Multi-Modal Tracking of Interacting targets using Gaussian Approximations, PETS2000, 2000. 7. Swain,J., Ballard,D., Color Indexing, IJCV, 7:1,11-32, 1991. 8. McKenna,S., Raja,Y., Gong,S., Tracking Colour Objects using Adaptive Mixture Models, Image and Vision Computing, 17,225-231, 1999. 9. Black,J., Ellis,T., Multi Camera Image Tracking, IEEE PETS2001, 2001. 10. Qai,Q., Aggarwal,k., Automatic Tracking of Human Motion in Indoor Scenes Across Multiple Synchronized Video Streams, ICCC98, Bombay, 1998. 11. Zhao,T., Nevatia,R., Lv,F., Segmentation and Tracking of Multiple Humans in Complex Situations, IEEE CVPR, Hawaii, 2001. 12. McKenna,S., Jabri,S., Duric,Z., Rosenfeld,A., Tracking Groups of People, CVIU, 80,42-56, 2000. 13. Mittal,A., Video Analysis Under Severe Occusions, PhD Thesis, University of Maryland, 2002. 14. Yang,D., González-Baños,H., Guibas,L., Counting People in Crowds with a RealTime Network of Simple Image Sensors, IEEE ICCV03, 2003. 15. Remagnino,P., Jones,G.A., Automated Registration of Surveillance Data for MultiCamera Fusion, ISIF, 1190-1197, 2002. 16. Chang,T., Gong,S., Bayesian Modality Fusion for Tracking Multiple People with a Multi-Camera System, In Proc. European Workshop on Advanced Video-based Surveillance Systems, UK, 2001. 17. Khan,S., Javed,O., Rasheed,Z., Shah,M., Human Tracking in Multiple Cameras, IEEE ICCV01, 331-336, 2001. 18. Kang,J., Cohen,I., Medioni, G., Continuous Tracking Within and Across Camera Streams, IEEE CVPR03, 267-272, 2003. 19. Kang,J., Cohen,I., Medioni, G., Continuous Multi-Views Tracking using Tensor Voting, Workshop on Motion and Video Computing (MOTION’02), 181-186, 2002. 20. Stein,G., Tracking from Multiple View Points: Self-calibration of Space and Time, Image Understanding Workshop, Nov. 1998. 21. Brémond,F., Thonnat,M., Tracking Multiple Non-Rigid Objects in Video Sequences, IEEE Transaction on Circuits and Systems for Video Technology Journal,vol.8,no.5, 1998.
Application of Radon Transform to Lane Boundaries Tracking R. Nourine1, M. Elarbi Boudihir2, and S.F. Khelifi3 1
ICEPS Laboratory, Djilalli Liabess University, Sidi Bel Abbes, Algeria
[email protected] &
[email protected] 2
Dept. Computer science and Information system, M. Ibn Saud University, Riyadh, KSA,
[email protected] 3
&
[email protected]
Dept. Computer science and Information system, King Faisal University, Demmam, KSA.
[email protected] &
[email protected]
Abstract. This paper describes a low-cost algorithm for tracking lane boundaries in a sequence of images. The algorithm is destined to painted road with slow curvature. The basic idea proposed in our approach is that complete processing of each image can be avoided using the knowledge of the lane markings position in the previous ones. The markings detection is obtained using Radon transform that exploits the markings brilliance relatively to the road surface. The experimental tests proved the robustness of this approach even in shadows presence. The originality of our approach compared to those using the Hough transform is that it does not require any tresholding step and edge detection operator. Consequently, the proposed method is proved to be much faster.
1 Introduction In the last two decade, a great deal of research in the domain of transport systems has been conducted to improve the safety conditions by the entire or partial automation of driving tasks. Among these tasks, the lane recognition take an important part in each drive assistance system and autonomous vehicle, which provides information such as lane structure and vehicle position relative to the lane. Thanks to the great deal of information it can deliver, computer vision become a powerful means for sensing the environment and has been widely used in boundary lane detection and tracking. In many proposed systems, the lane detection consists of the localization of specific primitives such as the road markings painted on the surface of the road. This restriction simplifies the process of detection, nevertheless, two situations can disturb the process: the presence of other vehicles on the same lane occluded partially the road markings ahead of the vehicle, or the presence of shadows (caused by trees, buildings, or other bridges). Two classes of lane detection systems dominate the autonomous guided vehicle field. The first class consists of edge-based systems. It relies on thresholding the image intensity to detect potential lane edges, followed by a perceptual grouping of the A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 563–571, 2004. © Springer-Verlag Berlin Heidelberg 2004
564
R. Nourine, M. Elarbi Boudihir, and S.F. Khelifi
edge points to detect the lane markers of interest. The problem with thresholding the image intensity is that, in many road scenes, it is not possible to extract the true lane edges points without extracting false edge points also. These false edge points can due to vehicles, puddle, crack, shadow or other imperfections in the road surface. The second class of systems overcomes this problem by directly working on the image intensity, as opposed to separately detected edges points, and using a global model of lane shape. Several accomplished tests on these systems, applied on extremely large and varied data sets proof that the second class of lane detection systems perform significantly better in comparison to the first class. This paper presents, a low-cost vision-based approach capable of reaching a realtime performance in detection and tracking of structured road boundaries (with painted lane markings), which is robust enough in presence of shadow conditions. Given a sequence of images acquired with a single camera mounted on the vehicle, we wish to automatically detect and track the road boundaries. The proposed approach requires some hypothesis, established on road structure. The vehicle is supposed to move on a flat and straight road or with slow curvature. Hence, the lane boundaries are assumed locally parallel, and, the lane markings can be described by two parallel straight lines in the front of the vehicle. Generally, even in presence of shadows the lane markings and the asphalt of roads are contrasted enough. Using this characteristic, we propose an approach based on the Radon Transform applied directly on the image gray levels to extract a segment-based model describing the current lane. Further, we can use temporal correlation between successive images to reduce, the processing cost and to optimize the lane boundaries tracking process. The paper is organized as follows. In section 2, a review of some lane boundary detection techniques is presented. Section 3 introduces the uses of Radon transform to straight lane boundaries detection. The proposed approach is presented in section 4, while the experimental results are illustrated in section 5.
2 Related Work At present many different vision-based lane detection algorithms have been developed. Among these algorithms, several representative lane detection systems are reviewed in this section. The GOLD system developed by Broggi et al uses an edge-based lane boundary detection algorithm [1]. The acquired image is remapped in a new image representing a bird’s eye view of the road, where the lane markings are nearly-vertical bright lines on a darker background. Hence, a specific adaptive filtering is used to extract quasivertical bright lines that concatenated into specific larger segments. Kreucher C. et all propose in [2] the LOIS algorithm as a deformable template approach. A parametric family of shapes describes the set of all possible ways that the lane edges could appear in the image. A function is defined whose value is proportional to how well a particular set of lane shape parameters matches the pixel data in a specified image. Lane detection is performed by finding the lane shape that maximizes the function for the current image. In the lane following process, LOIS uses information from previous frame.
Application of Radon Transform to Lane Boundaries Tracking
565
The Carnegie Mellon University proposes the RALPH system, used to control the lateral position of an autonomous vehicle [4]. It uses a matching technique that adaptively adjusts and aligns a template to the averaged scanline intensity profile in order to determine the lane’s curvature and lateral offsets. The same university developed another system called AURORA which tracks the lane markers present on structured road using a color camera mounted on the side of a car, pointed downwards toward the road [5]. A single scan line is applied in each image, to detect the lane markers. An algorithm destined to painted or unpainted road is described in [6]. Some color cues were used to conduct image segmentation and remove the shadow. Assuming that the lanes are normally long with smooth curves, then theirs boundaries, can be detected using Hough transformation applied to the edge image. A temporal correlation assumed between successive images is used in the following phase. Another lane detection and following approach based on Hough transform is proposed in [7]. It’s destined to extract road position during motorway driving scenarios. Some constraints are assumed on the road contours in order to reduce the Hough space search. Moreover, temporal constraints were assuming between successive images. Since the Hough transform works on binary valued images, it was necessary to combine a tresholding and edge detection operator. A review of the most advanced approaches to the (partial) customization of road following task, using on-board systems based on artificial vision is presented in [11]
3 Radon Transform to Straight Lane Boundary Detection Generally, the vision-based road detection and tracking systems uses a model in order to do reliable recognition. The use of models simplifies the detection process, by limiting the search area on specific image-zones and restricted intervals of model parameters. However, many of model-based systems establish some constraints on this environment in order to have a unique solution in the boundaries detection. In our case the constraints concern the road structure and are as follow: The vehicle is moving on a flat straight road or with slow curvature. The lane boundaries are assumed locally parallel. The lane boundaries are continuous in the image plane, which implies their continuity in the physical world. This constraint makes the prediction of a missing boundary possible (when the boundary detection technique fails). Based on these constraints, the portion of current lane in front of the vehicle can be described by two parallel straight lines. The perspective projections of these parallel lines to the image plan are not parallel and so converge to a vanishing point. The current lane can be described by a segment-based model, such as the boundaries are approximated by two straight lines as follow: Among the many approaches suggested in the literature to extract the lines in an image, the Hough transform and the related Radon transform have received much attention. These two techniques are able to transform two dimensional image with lines
566
R. Nourine, M. Elarbi Boudihir, and S.F. Khelifi
into a domain of possible line parameters, where each line will give a peak positioned at the corresponding line parameters. Many form of Hough transform have been developed [9] [10], that generally requires a preparation step of thresholding or filtering applied to gray-level images beforehand The Radon Transform doesn’t have this inconvenient, since it can work directly on the gray-level image. Let us consider a line expressed by Eq.2, where is the angle and the smallest distance to the origin of the coordinate system. As shown in Eq. 3, the definition of the radon transform for a couple is the line integral through the image g(x,y). The is the Dirac delta function which is infinite for argument 0 and zero for all other arguments [3].
The Radon transform will contain a peak corresponding to every line in the image that is brighter than its surroundings, and a trough for every dark line. Thus, the problem of detecting lines is reduced to detecting these peaks and troughs in the transform domain. The Radon transform is particularly suited for finding lines in images characterized by a large amount noise.
4 Lane Detection and Tracking Approach The approach for lane boundaries detection and tracking proposed in this works takes place in two phases. At the beginning, the vision system executes the initial phase that analyses the first acquired image, and allows vision system to initialize the following phase. Next, the following phase is performed on the subsequent images in order to follow up on lane boundaries. The algorithm of the proposed approach of lane detection and following is described by fig. 1.
4.1 Lane Boundaries Detection Such as we mentioned it into section 3, the current lane is supposed to be linear in front of the vehicle delimited by two straight lines. Thus, we propose to use the Radon transform to extract theirs respective parameters in the image plan, as described by Eq. 1. In the initial phase the vehicle is assumed to be centered inside the lane, oriented in the same way to its axis. Exploiting the knowledge about the acquisition parameters, the vision system can predict the lane boundaries orientation on the image plan. Moreover, the lane markings are supposed visible on the road, in the front of the vehicle. These assumptions mean that the lane boundaries are easily detectable in two distinct windows and Let us consider the fig. 2 as the first acquired image. We present below the algorithm of the left lane boundary detection. The similar algorithm is used for the right boundary.
Application of Radon Transform to Lane Boundaries Tracking
567
Fig. 1. Proposed algorithm for lane boundaries detection and tracking
Fig. 2. Typical image in initial phase
At the first, the vision system estimates the search domain of the orientation On the other hand, no explicit constraints were placed on the values. Nevertheless, its research domain depends on the image plan size. Next, the vision system compute the Radon Transform on the windows for all pair The Radon
568
R. Nourine, M. Elarbi Boudihir, and S.F. Khelifi
Transform is showed in fig 3.a, where the larges values around the pick correspond to lines that cross marks in the road. We consider that the optimal solution for the left boundary, represented by the line noted is tangent to the limit of lane marks such as showed in figure 1. To extract this solution we introduces a performance measure calculated for each pair This function measures the difference between the Radon Transforms computed on the two straight lines and positioned on each side of the line as indicated by Eq 4. Figure 3.b present all values of this function obtained from the windows The shift is chosen such as, for the optimal solution, the line passes through the white lane marks and the line passes through the dark road surface, as illustrated by fig. 2. Thus, the function should be maximal for the optimal solution as indicated by Eq 6, and illustrated by figure 3.b.
Fig. 3. (a): Radon transform applied on
(b): Performance measure
A similar process is applied for the right lane boundary research to extract its optimal parameters The vector regroups the lane boundaries parameters extracted in initial phase (time 0).
4.2 Lane Boundaries Tracking Each drive assistance system must ensure a real-time processing. To achieve this goal, the image processing speed must be fast. The basic idea proposed in our approach is that complete processing of each image can be avoided using the knowledge of the position of road boundaries in the previous ones. This strategy may be used by assuming high temporal correlation between subsequent images. Let us consider an image at time t-1 (fig 4.a), with already detected lane boundaries parameters Taking account this previous result as indicated in fig. 1, the vision system predicts the corresponding search area and for respectively the left and right boundaries in the next image at time t. In the same way, the search domains of parameters can be predicted (fig.4b). So, for the left boundary and for the right
Application of Radon Transform to Lane Boundaries Tracking
569
boundary, The shifts and was set empirically. Extracted lane boundaries in image t are showed in fig 4.c.
Fig. 4. Following process: (a) Lane boundaries in image t-1, (b) specific search windows in image t and search domains of boundaries parameters, (c): detected lane boundaries in image t
In any road detection system some mechanism is required for dealing with occlusion of road boundaries due to other vehicles, lack of well defined lane markings, etc. In the proposed approach, we treat only the lack of well defined lane markings, since at this stage we consider roads without obstacles. Based on the assumptions made on the road structure (see section 3), we consider that the lane width is fix. This width is initially known, or estimated from the first image. The suggested idea is to calculate the lane width for each analyzed image. If the extracted lane boundaries give a width close to that given initially, then the process continues normally. In the contrary case, the system recovers results from the preceding image (at time t-1) to predict the lane boundaries positions and orientations in the next image (at time t+1). After n successive failures the system stops. This strategy makes the boundaries detection more robust and the lane following process rather fast. Thus, we will be able to conceive a vision-based lateral position estimation system, able to alert in real time the driver when the vehicle begins to stray out of its lane.
5 Experimental Results The image processing in the sequence can be carried out in 0.17 second. However, it is important to say that the detection processes of the two lane boundaries are independent. Moreover Radon transforms calculated for the various values of the couple are also independent. Thus, parallel programming will reduce in an obvious way the computing time. The proposed approach has been tested in different conditions: without obstacles, on straight and curved roads. Figure 5 present a sequence of straight road without shades where the robustness of the approach is very strong. This robustness is not faded in the presence of the shades even for a curved road such as showed in figure 6. The most critical situation that we met in certain sequences is the lack of the lane markings on image research zones, as showed in figure 7. The first series (fig. 7.a) presents the final images where the lane width is not taken into account to validate the results. The left boundary detection is false for the two last images. It is the consequence of very large spacing between the marks. The second series shows the results obtained by exploiting the lane width to solve the problem markings absence.
570
R. Nourine, M. Elarbi Boudihir, and S.F. Khelifi
Fig. 5. Sequence of straight road
Fig. 6. Three sequences of straight and curved roads with presence of shadows
Fig. 7. Critical situation: (a) example of false detection,(b ) detection using previous result
Application of Radon Transform to Lane Boundaries Tracking
571
6 Conclusion In this paper, we present a low-cost lane detection algorithm based on video sequences taken from monocular vision-system mounted on a vehicle. Real time lane detection and tracking system is an essential module of an object detection and recognition system for crash warning and avoidance. The basic idea is that the complete processing of each image can be avoided using the knowledge of the position of lane
boundaries in the previous ones. The vehicle is supposed moving on a flat and straight road or with slow curvature. Hence, we consider the lane boundaries as straight lines. The proposed lane detection and tracking can be applied only in painted road. The lane boundaries were detected using Radon transform. The proposed approach has been tested on off line on video data, and the experimental results have demonstrated a fast and robust system. A parallel programming will obviously reduce the computing time.
References 1.
Bertozzi M., Broggi A., « GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection »”, IEEE Transactions on Image Processing 7 (1), 1998, p. 62–81. 2. Kreucher C., Lakshmanan S., Kluge K., « A Driver Warning System Based on the LOIS Lane Detection Algorithm », Proceeding of IEEE International Conference on Intelligent Vehicles, Stuttgart, Germany, 1998, p. 17-22,. 3. Toft P. « The Radon Transform - Theory and Implementation », Ph.D. thesis, Department of Mathematical Modeling, Technical University of Denmark, June 1996. 4. [POM 96] Pomerleau D. and Jochem T., « Rapidly Adapting Machine Vision for Automated vehicle Steering », IEEE expert, 1996, vol. 11, n°. 2, p. 19-27. 5. Chen M., Jochem T., Pomerleau D., « AURORA: A Vision-Based Roadway Departure Warning System », in Proceeding of IEEE Conference on Intelligent Robots and Systems, 1995, vol. 1, p. 243-248. 6. Ran B. and Xianghong Liu H., « Development of A Vision-based Real Time lane Detection and Tracking System for Intelligent Vehicles », Presented in Transportation Research Board Annual meeting, preprint CD-ROM, Washington DC. 2000. 7. Mc Donald J. B., « Application of Hough Transform to Lane Detection in Motorway Driving Scenarios », in Proceeding of Irish Signals and Systems Conference, Jun, 25-27, 2001, R. Shorten, T. Ward, T. Lysaght (Eds) 8. M. Bertozzi, A. Broggi, M. Cellario, A. Fascioli, P. Lombardi, and M. Porta, “Artificial Vision in Road Vehicles”, Proc of the IEEE - Special issue on “Technology and Tools for Visual Perception”, 90(7):1258-1271, July 2002. 9. Fung P., Lee W. & King I., « Randomized Generalized Hough Transform for 2-D Grayscale Object Detection », in proceeding of ICPR’96, August 25 - 30, 1996, Vienna, Austria, p. 511-515. 10. Hansen K., Andersen J. D., « Understanding the Hough transform: Hough cell support and its utilization », in Image and Vision Computing (15), March 1977, p. 205-218. 11. M. Bertozzi, A. Broggi, M. Cellario, A. Fascioli, P. Lombardi, and M. Porta, “Artificial Vision in Road Vehicles”, Proc of the IEEE - Special issue on “Technology and Tools for Visual Perception”, 90(7): 1258-1271, July 2002.
A Speaker Tracking Algorithm Based on Audio and Visual Information Fusion Using Particle Filter Xin Li1, Luo Sun1, Linmi Tao1, Guangyou Xu 1 , and Ying Jia2 1
Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology Tsinghua University, Beijing 100084, China {x-li02, sunluo00}@mails.tsinghua.edu.cn {linmi, xgy-dcs}@tsinghua.edu.cn 2
Intel China Research Center Raycom Infotech Park A, Beijing 100080 China
[email protected]
Abstract. Object tracking by sensor fusion has become an active research area in recent years, but how to fuse various information in an efficient and robust way is still an open problem. This paper presents a new algorithm for tracking speaker based on audio and visual information fusion using particle filter. A closed-loop architecture with reliability of each individual tracker is adopted, and a new method for data fusion and reliability adjustment is proposed. Experiments show the new algorithm is efficient in fusing information and robust to noise.
1 Introduction Intelligent environments such as distributed meetings and smart classrooms are gaining significant attention during the past few years [1] [2]. One of the key technology in these systems is a reliable speaker tracking module, since the speaker often needs to be emphasized. Now there exist tracking methods both by audio information (Sound Source Localization) (SSL)[3] and visual information[11]. As methods using only audio or visual tracking are not robust, researchers are now paying more and more attention to the fusion of audio and visual information. In general, there are two paradigms for audio and visual fusion: bottom-up and top-down. Both paradigms have a fuser (a module to fuse information) and multiple sensors. The bottom-up paradigm starts from sensors. Each sensor uses a tracker to estimate the unknown object state (e.g. object location and orientation)–to solve the inverse problem based on the sensory data. Once individual tracking results are available, distributed sensor networks [4] or graphical models [5] are used to fuse them together to generate a more accurate and robust result. To make the inverse problem tractable, assumptions are typically made in the trackers and the fuser, e.g., system linearity and Gaussianality are assumed in the Kalman A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 572–580, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Speaker Tracking Algorithm
573
tracker [6] and the fuser [4]. But these simplified assumptions inherently hinder the robustness of the tracking system. The top-down paradigm, on the other hand, emphasizes the fuser. It has an intelligent fuser but rather simple sensors [7][8]. It tries to achieve tracking by solving the forward problem. First, the fuser generates a set of hypotheses (also called particles. We use the two words interchangeably in the paper) to exploit the possible state space. Sensory data are then used to compute the likelihood/weight of each hypothesis. These weighted hypotheses are then used by the fuser to estimate the distribution of the object state. As it is usually much easier to verify a given hypothesis than to solve the inverse tracking problem (as in the bottom-up paradigm), more complex and accurate models can be used in the top-down paradigm. This in turn results in more robust tracking. However, because the sensors use verifiers instead of trackers, they do not help the fuser generate good hypotheses. The hypotheses are semi-blindly generated from the motion prediction[7]. So, when the possible state space is large, a great number of particles will be needed and this results in heavy computational cost. Recently, Y.Chen and Y.Rui proposed a new fusion framework that integrates the two paradigms in a principled way[10]. It uses a closed-loop architecture where the fuser and multiple sensors interact to exchange information by evaluating the reliability of various trackers. However, due to different characters of the visual and audio tracker, this method occasionally depresses the information provided by the audio tracker, thus is not robust under some conditions. In this paper, we propose a new fusion and tracker reliability adjustment method. Based on a closed-loop architecture, the new method emphasizes the audio information by making the visual tracker and audio tracker more symmetric. The tracking system then becomes more efficient in information fusion and robust to many kinds of noise. The rest of the paper is organized as follows. In section 2 we discuss our system framework and individual trackers. Section 3 describes our method of fusion and adjustment in detail. Section 4 gives some of the experimental results. Section 5 draws a brief conclusion.
2
System Framework and Individual Trackers
Our system uses an architecture similar as that in [10]. Individual trackers are first used to estimate the target’s position, then these information are sent to a fuser to get a more accurate and robust tracking result. Then the fuser computes and adjusts the reliability of each individual tracker. The whole process can be summarized as follows: 1. Estimate the target’s position by individual trackers. 2. Fuse information to get the final result. 3. Adjust the reliability of individual trackers. 4. Goto 1, process the data of next frame. We use a vision-based color tracker and an audio-based SSL tracker as individual trackers. The color tracker is used to track the speaker’s head, which is
574
X. Li et al.
modeled as an ellipse and initialized by a face detector. The SSL tracker is used to locate the speaker.
2.1
Audio Tracker (SSL Tracker)
Audio SSL is used to locate the position of the speaker. In our particular application of smart classroom, the system cares the most about the horizontal location of the speaker. Suppose there are two microphones, A and B. Let be the speaker’s source signal, and and be the signals received by the two microphones, we have:
Where D is the time delay between the two microphones, and represent reverberations, and and are the additive noise. Assuming the signal and noise are uncorrelated, D can be estimated by finding the maximum cross correlation between and (i.e. [3]. Suppose the middle point between the two microphones is position O, let the source be at location S. We can estimate by when as shown in [3]. is the speed of sound. This process can be generalized to a Microphone Array. For simplicity, we placed the camera and the Microphone Array so that their center points coincide in the horizontal plane. Given the parameters of the camera: focal length horizontal resolution horizontal middle point in the pixel coordinate the angle estimated by the Microphone Array can be converted into the camera’s horizontal pixel coordinate:
A reliability factor for SSL is also calculated based on the steadiness of the SSL results: we took sound source locations in n consecutive frames: The maximum difference between each consecutive pair is calculated:
Then the reliability factor can be obtained by a Gaussian model:
where is a parameter indicating the tolerance of the sound source position difference.
2.2
Visual Tracker (Color Tracker)
We used a kernel-based color tracker[11] to estimate the target’s position in a new frame. We assume that the color histogram of the target is stable. To
A Speaker Tracking Algorithm
575
track the object in the current frame, precious frame’s state is used as an initial guess. The following steps are used to find the new object state 1. Let index the iterations. Set 2. Initialize the location of the target in the current frame with compute the color histogram at and evaluate the similarity between the candidate and the target by computing the Bhattacharyya Coefficient [11]. 3. Derive the weights containing the gradient information[11]. 4. Find the next location of the target candidate using the gradient information: [11]. 5. Compute the color histogram at the new position and the similarity between and the target using the Bhattacharyya Coefficient 6. If goto step 7. Else, let goto step 5. 7. If stop. Else, let and and goto step 2.
3
Data Fusion and Reliability Adjustment
We first briefly review the Importance Particle Filter (ICondensation) algorithm [9], which is used in our system and then we describe our data fusion method.
3.1
Generic Particle Filter
In the Condensation algorithm the posterior distribution of the target’s position is approximated by properly weighted discrete particles:
Where
are the weights of the particles and stands for the function As this approximation will get closer and closer to the actual posterior. The target’s position can then be estimated by taking the expectation of the posterior. The ICondensation algorithm draws particles from an importance function, to concentrate the particles in the most likely space. The weights are then calculated as:
A recursive calculation of the weights can be obtained [12]:
576
X. Li et al.
Then, the Importance Particle Filtering process can be summarized in three steps: 1. Sampling: N particles are sampled from the proposal function 2. Measurement: Compute the particle weights by Eq (7). 3. Output: Decide the object state according to the posterior distribution. In the ICondensation algorithm, importance function is very crucial. For example, poor proposals (far different from the true posterior) will generate particles that have negligible weights, thus wasted. While particles generated from good proposals (similar to the true posterior) are highly effective. Choosing the right proposal distribution is therefore of great importance.
3.2
Fusion and Adjustment
Similar as in [10], we use individual trackers discussed above to generate hypotheses and verifiers (observation models) to calculate weights. However, in [10], the contour tracker and color tracker both use the tracking result of the previous frame as an initialization, so the reliability of these trackers will usually be pretty high as their proposals will not be far from the posterior. The SSL tracker, on the other hand, doesn’t depend on previous tracking results and its result is not always so accurate. So sometimes its reliability may become very low while in fact it provides valuable information about the speaker. We’ve found in an experiment that when a man passes by in front of the speaker while the SSL result is distracted a little from the speaker due to some inaccuracies, tracking is lost for a while and this decreases the reliability of the SSL tracker. The audio information is then depressed and few particles are drawn from it, which further makes the tracking result lost. In turn the reliability of the SSL tracker continues to drop and the lost tracking may not recover. To overcome this defect, we develop a new fusion and reliability adjustment method which emphasizes the audio information by making the visual and audio tracker more symmetric. Other than proposing a joint prior based on audio and visual tracker as in [10], we treat the visual tracker (color tracker) and the audio tracker (SSL tracker) separately. We assigned the two trackers each a reliability factor, where Note is different from The particle filter then proceed as follows: 1. Generate prior distribution: We generate two prior distributions:
where N indicates a normal distribution, and are the expectations– estimated object position achieved by the color tracker and SSL tracker. and are the covariance matrices (in 1 dimension, the variance) of the two distributions respectively, indicating the uncertainty of the two trackers.
A Speaker Tracking Algorithm
577
2. Generate particles and calculate weights: Particles are drawn from the two distributions and respectively. We then define the visual and audio observation models to calculate the weights. The visual observation model is defined as the similarity between the color histograms of the candidate and the target [11]:
And the audio observation model is defined as the ratio of the correlation value at this position to the highest correlation value [3]:
Assuming independence between the two observation models, the likelihoods are then calculated as:
The weights of particles are then calculated using Eq (7) 3. Decide the final target position: After the previous two steps we actually get two posterior distributions about the target: Two estimates of the target’s position are then obtained:
E indicates the expectation. The likelihoods of these two estimations are:
By applying the reliability of the video and audio tracker, we get:
Finally the maximum of position to be or
and
is selected to decide the target’s
4. Reliability Adjustment: In this step we tune the reliability factors. is already calculated above, here we adjust and according to and obtained earlier. and indicate how likely the estimated target’s position is to be the true position. So we define:
578
4
X. Li et al.
Experimental Results
Our algorithm has shown the following advantages. First, in the condition of finite (especially small number) particles, it sufficiently exploits the audio information by drawing a comparatively large number of particles form it, thus enhancing the robustness of tracking. Even if sometimes the tracker may fail, for example, when a person comes across the speaker, later the system will recover this error by using audio information (obtain a larger than Second, our algorithm is also robust to audio noise. When audio noise occurs, the sound source localization obtained will become unsteady, resulting in a small and this will decrease the influence of noise. Now we give some of our experimental results. In all frames the red rectangle represents the tracking result, the green line represents the result of SSL. The color tracker uses bins for the quantification of color space and 300 particles are used in ICondensation. The system runs on a standard AMD 1.8GHz CPU while processing the 15frames/sec video sequence. Figure 1 shows our algorithm-based fused tracker is more robust than a single vision-based tracker.
Fig. 1. Single vision-based tracker VS. our fused tracker. Upper row (left to right) is tracking by a single vision-based color tracker, lower row (left to right) is tracking by our fused tracker.
The single vision-based tracker (upper 3 frames) loses track, while the fused tracker (lower 3 frames) doesn’t. Figure 2 shows our new algorithm is more robust than the algorithm used in [10], when both using a color tracker and a SSL tracker. The upper 3 frames are tracking by the joint prior and reliability adjustment method in [10]. Tracking is lost and can not recover because the reliability of the audio tracker decreases rapidly. The lower 3 frames are tracking by our algorithm, which shows that tracking recovers after the two persons cross each other.
A Speaker Tracking Algorithm
579
Fig. 2. Compare with the method in [10]. Upper row (left to right) is tracking by fusion and adjustment method in [10], tracking is lost. Lower row (left to right) is tracking by our fusion and adjustment method, tracking recovers.
Fig. 3. Test against noise (left to right, up to down), including light change (turning on/off lights), background change, persons coming across each other and audio noises.
Figure 3 shows our algorithm is robust to noises. In this sequence our algorithm is tested against light change (turning on/off lights), background change, persons coming across each other, and also the audio noise in the room (computer fans, TV monitors’ noise etc).
5
Conclusion
In this paper, we presented a speaker tracking algorithm based on fusing audio and visual information. Based on a closed-loop architecture, we proposed a new fusing and tracker-reliability adjustment method which better exploits the symmetry between visual and audio information. Individual trackers are first used to track the speaker, then particle filter is used to fuse the information. Experiments show that with our proposed method, the system is efficient in fusing
580
X. Li et al.
information and robust to many kinds of noises. In future work, other trackers, such as the contour tracker, can also be included in the algorithm (as another visual tracker, for example) to further enhance the robustness.
References [1]
Ross Cutler, Yong Rui, Anoop Gupta, JJ Cadiz, Ivan Tashev, Li wei He, Alex Colburn, Zhengyou Zhang, Zicheng Liu, and Steve Silverberg, Distributed meetings: A meeting capture and broadcasting system, in Proc. ACM Conf. on Multimedia, 2002, pp. 123.132. [2] Yong Rui, Liwei He, Anoop Gupta, and Qiong Liu, Building an intelligent camera management system, in Proc. ACM Conf. on Multimedia, 2001, pp. 2.11. [3] Yong Rui and Dinei Florencio, Time delay estimation in the presence of correlated noise and reverberation, Technical Report MSRTR- 2003-01, Microsoft Research Redmond, 2003. [4] K. C. Chang, C. Y. Chong, and Y. Bar-Shalom, Joint probabilistic data association in distributed sensor networks, IEEE Trans. Automat. Contr., vol. 31, no. 10, pp. 889.897, 1986. [5] J. Sherrah and S. Gong, Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects, in Proc. IEEE Int’l Conf. on Computer Vision, 2001, pp. 42.49. [6] B. Anderson and J. Moore, Optimal Filtering, Englewood Cliffs, NJ: PrenticeHall, 1979. [7] J. Vermaak, A. Blake, M. Gangnet, and P. Perez, Sequential Monte Carlo fusion of sound and vision for speaker tracking, in Proc. IEEE Int’l Conf. on Computer Vision, 2001, pp. 741.746. [8] G. Loy, L. Fletcher, N. Apostoloff, and A. Zelinsky, An adaptive fusion architecture for target tracking, in Proc. Int’l Conf. Automatic Face and Gesture Recognition, 2002, pp. 261.266. [9] M. Isard and A. Blake, ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework, in Proc. European Conf. on Computer Vision, 1998, pp. 767.781. [10] Y. Chen and Y. Rui, Speaker Detection Using Particle Filter Sensor Fusion, in Asian Conf. on Computer Vision, 2004. [11] D. Comaniciu and P. Meer, Kernel-Based Object Tracking, in IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 25, NO. 5, MAY 2003 [12] R. Merwe, A. Doucet, N. Freitas, and E. Wan, .The unscented particle Filter,. Technical Report CUED/F-INFENG/TR 380, Cambridge University Engineering Department, 2000.
Kernel-Bandwidth Adaptation for Tracking Object Changing in Size Ning-Song Peng1, 2, Jie Yang1, and Jia-Xin Chen2 1
Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, P.O. Box 104, 1954 Huashan Road, Shanghai, 200030, China
2
Institute of Electronic and Information, Henan University of Science and Technology, Luoyang, 471039, China
{pengningsong, jieyang}@sjtu.edu.cn
Abstract. In the case of tracking object changing in size, traditional mean-shift based algorithm always leads to poor localization owing to its unchanged kernel-bandwidth. To overcome this limitation, a novel kernel-bandwidth adaptation method is proposed where object affine model is employed to describe scaling problem. With the registration of object centroid in consecutive frames by backward tracking, scaling magnitude in the affine model can be estimated with more accuracy. Therefore, kernel-bandwidth is updated with respect to the scaling magnitude so as to keep up with variety of object size. We have applied the proposed method to track vehicles changing in size with encouraging results.
1 Introduction The mean-shift algorithm [1] is an efficient and nonparametric method for nearest mode seeking based on kernel density estimation (KDE) [2]. It has been introduced recently for tracking applications [3,4,5,6]. Mean-shift tracking algorithm is a simple and fast adaptive tracking procedure that finds the maximum of the Bhattacharyya coefficient given a target model and a starting region. Based on the mean shift vector, which is an estimation of the gradient of the Bhattacharyya function, the new object location is calculated. This step is repeated until the location no longer changes significantly. Since the classic mean-shift iteration itself has no integrated scale adaptation, Bhattacharyya coefficient could suffer from the object scale changes so as to mislead the similarity measurement. Moments of the sample weight image is used to compute blob scale [4]. However, the computational complexity of this method is too high to meet the real-time requirement. In [5], an object scale is detected by evaluating three different kernel sizes (same size, ±5% change) and choosing the best one which makes the candidate most similar to the object model. In tracking object This work was supported by the National Natural Science Foundation of China under the Grant No.301702741 and partially supported by the National Grand Fundamental Research 973 Program of China under the Grant No.2003CB716104
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 581–588, 2004. © Springer-Verlag Berlin Heidelberg 2004
582
N.-S. Peng, J. Yang, and J.-X. Chen
expanding its scale, this method has no ability to keep the tracking window inflated because any location of a tracking window that is too small will yield the similar value of the Bhattacharyya coefficient. A mechanism for choosing the kernel scale is introduced in [6] based on Lindeberg’s scale-space theory where the author uses an additional scale kernel to do mean shift iterations in the scale space defined by in which is the initial scale. Because the author uses Epanechnikov kernel, the mean shift iterations for finding the best scale is equal to average all the scale in the scale space under the condition that the spatial space has been well matched. Obviously, this method is similar to the method in [5]. In this paper, we propose a kernel-bandwidth adaptation method under the assumption that the motion of the rigid object satisfies affine model in consecutive frames. First, by using our special backward tracking method, we compensate the deviation caused by the mean-shift iterations in current frame and register object centroid in consecutive frames. Based on the centroid registration, the normalized pixel locations in the tracking windows at consecutive frames are set into one coordinates with original point at object centroid, where we can easily obtain corner correspondences so as to estimate scaling magnitude of object affine model with more accuracy. Finally, kernel-bandwidth is updated by scaling magnitude for the next frame tracking.
2 Proposed Method We first review the mean-shift theory and its application of object tracking. In section 2.2, an efficient local-matching method that is used to find object corner correspondences is proposed. Its ability to reduce the mismatching pairs is based on the object centroid registration, see section 2.3. With the accurate correspondences, scaling magnitude in object affine model can be estimated more accurately by regression method. Finally, we update kernel-bandwidth according to the scaling magnitude for the next frame tracking. The outline of the whole system is shown in Fig. 1 where the details of each unit can be found in corresponding sections.
2.1 Mean-Shift Based Object Tracking Let data be a finite set A embedded in the n – dimensional Euclidean space X . The mean-shift vector at is defined as
where K is a kernel function and is the weight. The local mean-shift vector computed at position x points opposite to the gradient direction of the convolution surface
Kernel-Bandwidth Adaptation for Tracking Object Changing in Size
583
where G is the shadow kernel of K [1]. By using kernel-histogram as object model, J(x) is designed for measuring Bhattacharyya coefficient between the given model and object candidate and then the tracking problem converts to mean-shift iterations [3,4,5,6]. Let be the normalized pixel locations of object image, centered at Y . Considering gray level image, the kernel-histogram is defined as
where
is the bin index and m is the number of bins.
intensity at
Imposing the condition
is the quantized pixel derives the constant C.
is called kernel-bandwidth which normalizes coordinates of image so as to make radius of the kernel profile k to be one [5]. It is also the size of the tracking window covering the object we want to track.
Fig. 1. Outline of the whole system
2.2 Affine Model and Corner Correspondences We consider only the two kinds of motions encountered most frequently in the real world: translation and scaling. Therefore, the object affine model is given by
584
N.-S. Peng, J. Yang, and J.-X. Chen
where and are the same location of object feature points in frame i and i +1 respectively. specifies the translation parameter and the scaling magnitude. Using s , the kernel-bandwidth can be updated by
In this paper, object corner correspondences are used to estimate the parameters of affine model. Suppose the number of corners extracted from the tracking window in frame i , are N while corners in the tracking window in frame i +1, are M . Moreover, assume and are in the same size and their centers indicate the object centroid in respective frame, i.e. the object centroid is registered. Given a corner point in its correspondence point in the should satisfies
where I is pixel intensity and n(n<M) is the number of candidate corners which are within a given window centered at the same location of in the For each in the we use Eq. (6) to find its correspondence in Obviously, our localmatching method can be readily implemented with the computational cost O(Nn) comparing to the maximum-likelihood template matching method in which computational complexity is O(NM) [7,8].
2.3 Object Centroid Registration Unchanged kernel-bandwidth often leads to poor localization when the object expands its size. On the contrary, when the object shrinks its size, though tracking window will contain many background pixels as well as the foreground object pixels, the center of tracking window always indicates the object centroid [6], which gives us a cue to register the object centroid when object expands its size. In the case that object shrinks its size, Eq. (6) is directly used to get correspondences without doing the registration described below. Assume at frame i, the object with its centroid at is well enclosed by an initial tracking window centered at When the object expands its size at frame i +1, there should be a bit offset that is the matching deviation caused by mean-shift tracking algorithm, where and are the real centroid of the expanded object and the center of the tracking window in frame i + 1, respectively. Obviously, some parts of the object are in because the current object size is bigger than the size of To register the two object centroids,
Kernel-Bandwidth Adaptation for Tracking Object Changing in Size
585
first, we generate a new kernel-histogram representing the image, i.e. the partial object enclosed by Actually, indicates the centroid of this partial object. From frame i +1 to i, this partial object shrinks its size, it is possible for us to find its accurate centroid in frame i. Secondly, we seek for backwards in frame i also by mean-shift tracking algorithm. Therefore, there should be another offset between and Assuming the inter-frame motion is small, we can compensate the deviation d by Finally, the object centroid in frame i + 1 is evaluated by
Given the object image from previous frame, we can use Eq. (7) to register its centroid at current frame before we use Eq. (6) to find corner correspondences, which efficiently reduces the mismatching pairs. In addition, the translation parameter of object affine model is directly obtained by
Fig. 2. Illustration of the backward tracking for centroid registration
Fig. 2 illustrates the workflow of the registration. The block in the left picture is the initial tracking window. When the object expands its scale in frame i +1 (mid picture), unchanged kernel-bandwidth leads to poor localization, see the thin block. The backward tracking is then performed with a new kernel-histogram generated within the thin block and the corresponding tracking window is shown in right picture. Finally, using Eq. (7), the object centroid is registered, see the thick block in the mid picture. The image pixels inside the initial tracking window and the thick block in the mid picture are then unified and set into one coordinates with original point at the center of the block, i.e. the object centroid where Eq. (6) is used to find corner correspondences. In general cases, before registration, we always extend the size of the thick block so as to obtain more possible candidate pairs, see also
586
N.-S. Peng, J. Yang, and J.-X. Chen
experiments in section 3. From the process above, we can obtain the correspondences with less mismatching data, and then traditional regression method is employed to get the final scaling magnitude that is used to update kernel-bandwidth by Eq. (5).
Fig. 3. Tracking results comparison (left to right)
Fig. 4. Tracking windows and corners extracted from them
3 Experimental Results In Fig.3 the vehicle is our object which runs towards to the camera. The tracking results are showed by white windows. The top row is the result by using fixed kernelbandwidth initiated in the first frame. Observe that the center of the tracking window departs from the object centroid gradually. The middle row shows the tracking result by using the method proposed in [5]. Observe that the size of the tracking windows
Kernel-Bandwidth Adaptation for Tracking Object Changing in Size
587
decrease gradually due to its limitation, i.e. any location of a tracking window that is too small will yield the similar value of the Bhattacharyya coefficient. In the bottom row, encouraging results are obtained owing to the backward tracing method which registers object center in the current frame so as to evaluate satisfied kernel scale by using information of object affine model.
Fig. 5. Corner correspondences contrast
588
N.-S. Peng, J. Yang, and J.-X. Chen
Fig. 4(a) and (b) are from the last two images in the top row of Fig. 3. Dashed block in (b) is the tracking window by fixed kernel-bandwidth while the unregistered one is the white thin block. (c), (d) and (e) show the corners extracted from the previous tracking window, unregistered tracking window and registered tracking window, respectively. The size of the windows in (c) and (d) is extended in order to get more candidate correspondences. In Fig. 5 corners are set into one coordinates with origin point at the center of tracking windows and the correspondent relationship is represented by a line. Fig. 5(a) shows the correspondences between corners from the previous tracking window (plus) and the unregistered tracking window (dot), while Fig. 5(b) shows the correspondences between corners from the previous tracking window (plus) and corners form the registered window (dot). Obviously, in contrast to Fig. 5(a), the lines in (b) are almost in the same orderly trend, which indicates the mismatching pairs are remarkably eliminated by the centroid registration. Therefore, we can obtain more accurate scaling magnitude by traditional regression with corner correspondences as samples. The new tracking window driven by the updated kernel-bandwidth is show in Fig. 4(b) (white thick block).
4 Conclusion An automatic Kernel-bandwidth selection method is proposed for mean-shift based object tracking. By discovering object affine model based on backward tracing and object centroid registration, tracker can handle the situations in which object changes its size, especially the size variations are large.
References 1. 2. 3. 4. 5. 6. 7. 8.
Cheng Y.: Mean Shift, mode seeking and clustering. IEEE Trans. Pattern Analysis Machine Intelligence. 8(1995) 790-799 Fukanaga K. and Hostetler L.-D.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Information Theory, 1(1975) 32-40 Yilmaz A., Shafique K., and Shah M.: Target tracking in airborne forward looking infrared imagery. Int. J. Image and Vision Computing, 7(2003) 623-635 Bradski G.R.: Computer vision face tracking for use in a perceptual user interface. In: Princeton, NJ. IEEE Workshop on Applications of Computer Vision. (1998) 214-219 Comaniciu D., Ramesh V., Meer P.: Kernel-based object tracking. IEEE Trans. Pattern Analysis Machine Intelligence. 5 (2003) 564-575 Collins R.T.: Mean shift blob tracking through scale space. IEEE Int. Proc. Computer Vision and Pattern Recognition. 2 (2003) 234-240 Hu W., Wang S., Lin R.S. and Levinson S.: Tracking of object with SVM regression. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2(2001) 240-245 Olson C.F.: Maximum-Likelihood template matching. IEEE Int. Conf. Computer Vision and Pattern Recognition. 2(2000) 52-57
Tracking Algorithms Evaluation in Feature Points Image Sequences Vanessa Robles1, Enrique Alegre2, and Jose M. Sebastian3 1
Dpto. de Ingeniería Eléctrica y Electrónica, Universidad de León, España
2
Dpto. de Ingeniería Eléctrica y Electrónica, Universidad de León, España
[email protected] [email protected] 3
Dpto. de Automática, Ingeniería Electrónica e Informática Industrial, Universidad Politécnica de Madrid, España
[email protected]
Abstract. In this work, different techniques of target tracking in video sequences have been studied. The aim is to decide whether the evaluated algorithms can be used to determine and analyze a special kind of trajectories. Different Feature Point Tracking Algorithms have been implemented. They solve the correspondence problem starting from a detected point set. After carrying out various experiments with synthetic and real points, we present an algorithm result assessment showing their adaptability in our problem: boar semen video sequences.
1 Introduction The present work is part of a research project that is focusing in assessing frozenthawed boar semen in order to evaluate its cryopreservation and post-thawed fertility. We use image processing techniques to analyze abnormal forms. The first defects that are studied are head, acrosome and shape features using geometric and textural methods. The second defect studied is sperm motility. For this reason, we firstly need to know each spermatozoon trajectory. After, we will obtain some track measures, as straight line velocity, curvilinear velocity, beat cross frequency, amplitude of lateral head displacement, etc.
2 Previous Works Interest in motion processing has currently increased with advances in motion analysis techniques methodology and image and general processing capabilities. The usual input to a motion analysis system is a temporal image sequence. From 1979 to present days, several authors have provided solutions to the tracking problem. Their algorithms intend to obtain and to analyse selected object trajectories in an image sequence. As the correspondence problem is combinatorially explosive, A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 589–596, 2004. © Springer-Verlag Berlin Heidelberg 2004
590
V. Robles, E. Alegre, and J.M. Sebastian
the researchers have used a number of constraints. These restrictions include maximum velocity, changes in velocity or smoothness of motion, common motion, consistent match, rigidity, etc. Moreover, an important issue in this problem is that the authors intend to convert the qualitative heuristics solutions into quantitative expressions, which become the cost functions [10], [9], [5]. Thus, the aim is to search for the trajectory set which minimizes one of these functions. However, it is necessary to point out that the enumeration of all possible sets and picking the one with the least cost is not possible. Therefore, the authors need to use a good approximation algorithm to obtain a suboptimal solution that is very close to the optimum solution. The main research efforts are the Ullman [12] approach, who has proposed a minimal mapping theory for correspondence assuming a probabilistic nature of movement; Jenkin [4], who has presented a method for tracking the three-dimensional motion of points from their changing two dimensional perspective images as viewed by a no convergent binocular vision system; Barnard and Thompson [11] that have proposed an approach based on relaxation; Aggarwal, Davis and Martin [6] have suggested iconic methods and structure based methods; Sethi and Jain [10] have presented and optimal approximation using an iterative algorithm that suppose uniform movement; Salari and Sethi [13] have presented an other iterative algorithm that considers several inter-point events; Rangarajan and Shah [9] have proposed a proximal uniformity constraint based on a non iterative polynomial algorithm; Chetverikov and Verestóy [3] have suggested a competitive hypothesis checking correspondence method, and finally, Veenman-Reinders and Backer [1] have solved the correspondence problem using statistic and heuristic methods. In this work the Sethi and Jain [10], Salari and Sethi [13], Rangarajan and Shah [9] and Chetverikov and Verestóy [2] algorithms have been experimented. They try to solve the following correspondence problem: given n frames in any time instant and m points for each frame, the correspondence between points of consecutive frames is establised, so that any two points in a frame cannot be corresponded to the same point in the previous frame. The main features of each method are summarised.
2.1 Sethi and Jain Algorithm A Path Coherence Function is used by Sethi and Jain [10]. It proposes that the motion of a rigid or nonrigid object at any time instant cannot change arbruptly. It supposes that there are m points per frame of a sequence. The points have been obtained with a feature detector or a general corner detector. The trajectories are found by the algorithm. For each trajectory, the deviation should be the minimal and the sum of the deviations for trajectories should also be minimal. The Path Coherence Function is defined in (1) where and are weights; are the i point coordinates in the frame k and is the vector from point i in frame k-1 to the point j in frame k. The first term can be considered directional coherence and the second term can be considered speed coherence.
Tracking Algorithms Evaluation in Feature Points Image Sequences
591
The deviation term for a trajectory is defined in (2).
The optimized iterative algorithm proposed is called “Greedy Exchange Algorithm (GEA)” and it is based on greedy algorithms. First, it obtains the initial trajectories using the nearest neighbors. Next, the trajectories are refined through an exchange loop that maximizes the smoothness increase. This value is obtained from (2) starting in the second frame. Finally, the algorithm finishes when no exchanges are left. The algorithm to work, it is essential that the correspondence for the first two frames be correct, since it is never altered. But it rise to wrong results for the case where the displacement of the objects was comparable to their size. To eliminate this kind of errors, the authors proposed a Modified Greedy Exchange Algorithm that allows the exchange loop to operate in both directions through a cascade of free loops. According to our experiments we have observed that entries, exits and occlusions have not been considered. This causes incorrect trajectories. Besides this, we have detected that the algorithms fail with a large number of points in each frame, and they could fall in infinite loops if the minimun value of the gain is not restricted, or if a change made before is allowed to be repeated.
2.2 Salari and Sethi Algorithm Salari and Sethi [13] propose a modified algorithm that uses Path Coherence Function. Phanton Feature Points (PFP) are used as fillers to exted all trajectories over the given frame set. Displacement (3) and local smoothness deviation (4) values have been defined for a trajectory that has some PFPs assigned to it.
The constant refers to the maximum distance, so that PFPs always move in a fixed distance. The constant refers to maximum deviation: A penalty is imposed for missing feature points. First, the algorithm obtains the initial trajectories using the nearest neighbors limited by The incomplete trajectories are extended using PFPs. Next, the exchanges that provide the maximum smoothness increment are made from (2) limited by Finally, the algorithm finishes when no exchanges are left. In the experiments, we have checked that the algorithm is near the optimum solution, as it allows for entries (making new paths), exits (finishing paths before the
592
V. Robles, E. Alegre, and J.M. Sebastian
last frame) and occlusions (broken paths). But this algorithm depends on and limits, so that is an efficient algorithm when the limits are adapted to real values. In our problem, there are unpredictable occasions where the spermatozoon motion is random and an aproximated maximum deviation can yield wrong trajectories. On the other hand, the algorithm do not correct the broken trajectories, so it will not adapt to our problem because it obtains a higher number of spermatozoa compared to real results. In consequence, we have checked that this algorithm overflows when there are very small maximum distances, preventing the correspondences between frames and when there are a large number of points or frames.
2.3 Rangarajan and Shah Algorithm Rangarajan and Shah [9] try to resolve the correspondence problem through an noniterative algorithm based of an polynomial approach that minimizes the Cost Function called “Proximal Uniformity” (5) establising the correspondence on n frames. In this way, giving the position of a point on a frame, its position on the next frame is in the proximity of its previous position and the resulting trajectories are smooth and uniform and they do not show abrupt changes in the speed vector in time. The algorithm gives a reasonable solution although not necessarily optimal. It supposes that the initial correspondence (Little et all method [7]) and the set of feature points are previously known (obtained with an interest operator for example). It determines the correspondence between consecutive points of frames minimizing the function of Proximal Uniformity (5). The first term of this equation represents a relative change in the speed (it leads to smooth and uniform trajectories), whereas the second term denotes a relative displacement (it forces the proximal correspondence). are the q point coordinates in the frame k and (q) is the correspondent point of q in the frame k.
The function guarantees smooth changes of speed and direction, small displacements between two frames, predisposition for intersecting trajectories and avoids the fall in local minimums. In experiments we have seen that the speed of calculation with respect to both previous methods is considerably greater. Despite of its good results, this approach does not adapt well to our problem due to the restriction in the initial correspondence, impossible to obtain with exactitude due to possible occlusions in the first frames.
2.4 Chetverikov and Verestóy Algorithm Chetverikov and Verestóy [2] and [3] implemented a new algorithm of tracking called “IPAN Traker” to solve the problem of tracking dense feature point sets in incomplete
Tracking Algorithms Evaluation in Feature Points Image Sequences
593
trajectories. Starting from the Salari and Sethi [13] algorithm and using (1) they made an algorithm that calculates the motion trajectories. It operates in sets of three frames and makes several processes: Initialization, Processing of the next frame and Postprocessing of the broken trajectories. In order to verify its effectiveness in our problem the implementation proposed by Chetverikov et al (Group IPAN) through the Point Seth Motion Generator [8] has been used. In the results obtained we have observed that a the result is offered more approximated than the previous algorithms although it makes some errors as discussed in the following section.
3 Experiments In a first stage we checked the algorithms by synthetic points generated throught several functions which simulate the spermatozoa motion in an image sequence. A graph with the sample functions is shown in “Fig. 1”. Ten frames and six points in each frame have been considered in the feature point set. In this example we obtained the optimum result with the two first algorithms, as “Fig. 2” show. Although Salari and Sethi algorithm depend on and parameters, we obtained the optimum trajectories using 8 and 0.1 values respectively.
Fig. 1. Motion functions.
Fig. 2. Sethi & Jain and Salari & Sethi Trajectories.
For each trajectory, the deviation result are and respectively. The total deviation is Then we tried to check them with real points. In these experiments, we took an image sequence of 8 frames. It is shown in “Fig. 3”. We counted between 15 and 18 spermatozoa. We took the image in 20x. Different sequence frames were extracted and processed to correct the camera interlaces lines. Later they were segmented and postprocessed to eliminate the regions that did not correspond with spermatozoa. Finally the centroid of each one of the valid regions was obtained, this one being the point on which the tracking was made. If we observe the sequence of “Fig. 3” we can see that some spermatozoa do not move and others do at a moderate speed. We are interested in those that move with rectilinear and moderate speed and those that they move with a medium speed with little or no rectilinear trajectory.
594
V. Robles, E. Alegre, and J.M. Sebastian
Fig. 3. Image Sequence
A good result has not been obtained when we executed the Sethi and Jain algorithm for this set of points. This is due to the fact that the number of points in the sequence has to be constant and, to guarantee this restriction, we have limited this number to the minimum number of points by frame. Therefore, correspondences between frames are lost, as it is reflected in “Fig. 4” in the last frames, where the amount of points is more variable. The total deviation is The Salari and Sethi algorithm, allowing entries, exits and occlusions, which give rise to broken trajectories, obtains a better approximation, but not optimal. It obtains 21 spermatozoa as result, when there are at most 18. If only the spermatozoa trajectories located to the left in “Fig. 5” are observed we see that it obtains a good result for this subset. The total deviation is better than the previous,
Fig. 4. Sethi and Jain Trajectories.
Fig. 5. Salari and Sethi Trajectories.
The trajectories obtained by the Rangarajan and Shah algorithm are shown in “Fig.6”. The total deviation is the worst of all in this examples. Although the result is optimal for the majority of spermatozoa, errors have appeared when carrying out the correspondences between some frames, as it is shown in T12, T14 and T15. This is produced when entries, exits and occlusions appear in the sequence and are not contemplated due to the number of points by frame is kept constant. In addition to this very unfavorable limitation in our case, the algorithm needs the previous calculation of the initial correspondences. This data influences
Tracking Algorithms Evaluation in Feature Points Image Sequences
595
considerably on the final result as it is difficult to obtain it with certainty in our problem. Finally, we have checked the Chetverikov and Verestóy [2] algorithm through [8]. Starting with the feature points set obteined in the image sequence “Fig. 3”, we have obtained the result shown in “Fig. 7”. As can be observed, it is a near real situation approach, but it is important to note that entries, exits and occlusion will be produced, which are not considered in our problem. In this implementation the number of points in each frame is constant. For this reason, it produces a loss of information that, in our example, will become apparent in the loss of spermatozoon tracking. The result shows 16 trajectories when it should be 18 trajectories. The total deviation is the best of all.
Fig. 6. Rangarajan and Shah Trajectories.
Fig. 7. IPAN Tracker Trajectories.
4 Conclusions In the experiment, we have observed that the algorithms we have tested, have a high computational complexity, which means a high temporal execution cost and their overflow when there are a large number of points or frames. Best results have been obtained with synthetic points obtaining through several functions, since we work with the same set of points in the sequence. There are no entries, exits or occlusions. Te correct feature points detection is an important aspect where we could avoid occlusions or sly points. In our problem, we could establish that event has been produced observing the spermatozoon tails and writing two points instead of one in the feature point set. In the first two algorithm implementations, we have controled the infinite loops limiting the maximun gain minimum value to 0.0001. If a change is lower than this value, we will consider that it is an not significant value and the change will not be possible. Unlike the other algorithms, the non-iterative [9] algorithm improves the execution speed but the quality of its results depends of initial correspondences, which means a hard to evercome limitation to our problem. The [2] algorithm also improves the speed and permits a large density of points on the sequence. It used the possible partial occlusions that could be produced in synthetic point but, it still is an important limitation in real points and in our problem.
596
V. Robles, E. Alegre, and J.M. Sebastian
Finally we can conclude that the algorithms with smooth and rectilinear motions but not with random or partially abrupt motions work well.
5 Future Works We are currently implementing more recent algorithms like Chetverikov and Verestóy [2], [3] or Veenman [1] Algoritms. They propose stronger methods to join broken trajectories or to predict the correspondent point in a frame through other restrictions. In future works we will try to exceed the work efficiency and speed for a large number of features and will try to solve entries, exits, occlusions and partial slies. This work was supported by the Comisión Interministerial de Ciencia y Tecnología of the Spanish Government under the Project DPI2001-3827-C02-01.
References 1.
2. 3. 4.
5. 6. 7. 8. 9. 10.
11. 12. 13.
C.J. Veenman, M.J.T. Reinders and E. Backer: Resolving Motion Correspondence for Densely Moving. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, (January 2001) 54-72 D. Chetverikov and J. Verestóy: Feature Point Tracking for Incomplete Trajectories. Image and Pattern Analysis Group. Budapest (1999) D. Chetverikov and J. Verestóy: Tracking feature points: A new algorithm. In Proc. International Conf. on Pattern Recognition (1998) 1436-1438. Elena Sánchez Nielsen and Mario Hernández Tejera: Seguimiento de Objetos Móviles usando la Distancia de Hausdorff. Departamento de Estadística, Investigación Operativa and Computación. Universidad de La Laguna. Tenerife (2000) G. L. Shaw and V. S. Ramachandran: Interpolation during apparent motion. Perception, vol. 11(1982) 491-494. J. K. Aggarwal, L. S. Davis and W. N. Martin: Correspondence processes in dynamic scene analysis (1981) James J. Little, Henirich H. Bulthoff and Tomaso Poggio: Parallel optical flow using local voting. Proceedings of Second ICCV (1988) J. Verestoy and D. Chetverikov: Feature Point Tracking Algorithm. Image and Pattern Analysis Group. Budapest (1998) http://visual.ipan.sztaki.hu/psmweb/index.html K. Rangarajan and M. Shah: Establishing motion correspondence. CVGIP: Image Understanding (1991) 56-73. K. Sethi and R. Jain: Finding trajectories of feature points in a monocular image sequence. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-9, no.l (January 1987) 56-73. S. T. Barnard and W. B. Thompson: Disparity analysis of images. IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-2 (1980) 333-340. S. Ullman: The Interpretation of Visual Motion. Cambridge, Press (1979) V. Salari and I. K. Sethi: Feature point correspondence in the presence of occlusion. IEEE Trans. Pattern Analysis and Machine Intelligence (1990) 87-91.
Short-Term Memory-Based Object Tracking Hang-Bong Kang and Sang-Hyun Cho Dept. of Computer Engineering, The Catholic University of Korea #43-1 Yokkok 2-dong Wonmi-Gu, Puchon City Kyonggi-Do, Korea
[email protected]
Abstract. In this paper, we propose a new tracking method that adapts itself to suddenly changing appearance. The proposed method is based on color-based particle filtering. A short-term memory model is introduced to handle the cases of sudden appearance changes, occlusion, disappearance and reappearance of tracked objects. A new target model update method is implemented. Our method is robust and versatile for a modest computational cost. Desirable tracking results are obtained. Keywords: object-tracking, color-based particle filtering, short-term memory, appearance changes.
1 Introduction Tracking the human subject plays an important role in the video surveillance and monitoring systems. In particular, real-time tracking of non-rigid objects like multiple faces can be a challenging task. The object tracking algorithm should be computationally efficient and robust to occlusion, changes in 3D pose and scale. Various object tracking algorithms have been developed. Comaniciu et al.[1,2] proposed mean shift tracker which is a non-parametric density gradient estimator based on color distribution. The method can reliably track objects with partial occlusions. Isard et al. [3,4] used color information based on Gaussian mixture models. Nummiaro et al. [5,6] proposed an adaptive color-based particle filter by extending CONDENSATION algorithm [3]. This proposed method shows good performance in comparison with mean shifter tracker and Kalman filtering [7]. However, limitations still exist in dealing with sudden appearance changes. To resolve this problem, appearance model based tracking [8] is proposed, but this method has problems with occlusions. To deal with sudden changes in 3D pose and occlusions, we propose a new object tracking method based on the color-based particle filter [5,6]. If new appearance of a tracked object is observed, we store the appearance model into the short-term memory. After that, the appearance models in the memory are referenced whenever an estimated object state needs to be determined. The novelty of the proposed approach mainly lies in its adaptation with sudden appearance changes.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 597–605, 2004. © Springer-Verlag Berlin Heidelberg 2004
598
H.-B. Kang and S.-H. Cho
The paper is organized as follows. Section 2 discusses memory-based appearance representation scheme. Section 3 presents our proposed tracking method. Section 4 shows experimental results of our proposed method in object tracking.
2 Memory-Based Appearance Representation In this Section, we will present our memory-based appearance representation method to handle the cases of sudden appearance changes or occlusions.
2.1 Appearance Model If there are significant differences between the target and the objects in observation, tracking might not be possible. These differences might occur due to changes in illumination, pose, or occlusion between the moment when the target model was built and the moment when the observations was made. For example, Figure 1 shows one such instance. As the human subject turns the corner and walks to the vending machine, the local appearance of the human face changes rapidly. In this situation, the usual template matching tracking method like CONDENSATION (or color-based particle filter) [3-5] may not track the object well because the tracking method does not keep track of the suitable appearance models. Even though target model update performs adaptively [6], this cannot handle sudden appearance changes. So, it is necessary to maintain different appearance models observed in tracking the object. Two things should be considered in maintaining appearance models. One is to decide when the new appearance model should be added to the target model list. The other is to decide how long the model should be maintained. Usually, people can track objects pretty well based on his memory or experiences regardless of occlusions or sudden appearance changes. So, we generate a similar model to people’s short-term memory, in order to deal with various appearance models.
Fig. 1. Sudden appearance changes
2.2 Short-Term Memory Model To handle the cases of sudden appearance changes and occlusions in tracking, we introduce a short-term memory-based method in updating target model. Figure 2 shows a causal FIFO short-term memory based model. The model consists of a memory buffer and an attention span. The memory buffer keeps the old reference (or
Short-Term Memory-Based Object Tracking
599
target) models and attention span maintains a number of recent reference models. When a new reference model enters the short-term memory, the oldest model is deleted from the memory. The size of memory buffer depends on the duration of keeping old reference models. At the initialization step, a copy of the target model in the tracking process enters the memory as a reference model and is placed on the attention span. Using particle filtering [6], we estimate the next object state as the mean state of object. First, we compute the similarity between estimated state and current target model. If the similarity value is larger than the threshold the update of target model is done based on the estimated state and current target model. This is shown in Figure 3(a). Otherwise, if the similarity value is larger than another threshold value we search for similar reference modes from the attention span of the memory. If the estimated state is different from that of current target model but has existed before the tracking process has begun, related reference model may be in the attention span. In this case, we can track the designated object effectively. If the similarity value between the estimated state and one of the reference models in the attention span is less than the threshold value large appearance changes have occurred to the tracked object. Then, the estimated state enters the memory as a reference model and is also used as current target model (see Figure 3(c)). Tracking process will continue whenever there are abrupt appearance changes.
Fig. 2. Short-term Memory Model
When occlusion or disappearance of the tracked object occurs, the difference between the value of estimated state and the value of one of the recent reference models is larger than the threshold. This is the same as the case of abrupt appearance changes. To distinguish these two cases from each other, we compute the frame differences to see whether the object disappeared or not, because the object motion is not detected in the case of the disappearance of the tracked object from the stationary camera. If the value of frame differences is smaller than the threshold value, we decide that the object is disappeared. So, the estimated state dose not enter the shortterm memory because it is not a new appearance model. When we construct a robust tracking system regardless of computational cost, the search area for related reference model can be extended into whole memory buffer instead of attention span. By using a short-term memory model, we can keep the latest appearance models in tracking process.
600
H.-B. Kang and S.-H. Cho
3 Short-Term Memory-Based Tracking In this Section, we will present how to implement our proposed tracking system in detail. Our tracking system is based on color-based particle filter [6] and is extended to keep different appearance models in the short-term memory, so that it can adapt its tracking process in the cases of sudden appearance changes and occlusions of objects.
3.1 Color-Based Particle Filter We define sample state vector s as where x, y designate the location of the ellipse, the length of the half axes and k the corresponding scale change. The dynamic model can be represented as
where A defines the deterministic component of the model and Gaussian random variables.
Fig. 3. Memory-Based Update Method
is a multivariate
Short-Term Memory-Based Object Tracking
601
As target models, we use color distributions because they achieve robustness against non-rigidity, rotation and partial occlusion. The color distribution at location y is calculated as
where I is the number of pixels in the region,
is the Kronecker delta function,
is used to adapt the size of ellipse, f is normalization factor and w is the weighting function such that smaller weights are assigned to the pixels that are further away from the region center. As in color-based particle filter [6], the tracker selects the samples from the sample distribution of the previous frame, and predicts new sample positions in the current frame. After that, it measures the observation weights of the predicted samples. The weights are computed using a Gaussian with variance
where d is the Bhattacharyya distance. Bhattacharyya distance is computed from Bhattacharyya coefficient which is a popular measure between two distributions p(u) and q(u). Considering discrete densities such as our color histogram and the coefficient is defined as
and the Bhattacharyya distance is
The estimated state at each time step is computed by
3.2 Short-Term Memory-Based Target Model Updating To update target model, we compute target update condition as where is similarity between the estimated state E(S) and target model is the target update threshold value.
602
H.-B. Kang and S.-H. Cho
If this condition is satisfied, the update of the target model is performed by where weights the contribution of the estimate state histogram Otherwise, we compute between the estimated state E(S) and one of the reference models in the attention span using Bhattacharyya distance. We use the update reference model condition
where is the memory update threshold. If the update reference model condition is satisfied, the estimated state E(s) enters the memory as a reference model and the current target model is changed into E(s). Otherwise, the target update model is constructed from related reference model in the memory attention span and the estimated state model. Figure 4 shows our proposed target model update algorithm. To track multiple objects with our short-term memory model, we implement multiple short-term memory models to maintain various appearance models for each object. When the objects merged into one and then separated later as each object, our model can track objects effectively because the current target model in our memorybased tracking is better than that of the color-based particle filter method.
Fig. 4. Target Model Update Algorithm
Short-Term Memory-Based Object Tracking
603
4 Experimental Results Our proposed short-term memory-based tracking algorithm is implemented on a P41.5Ghz system with image size. We made several experiments in a variety of environments to show the robustness of our proposed method.
Fig. 5. Tracking Results: (a) color-based particle filter, (b) short-term memory-based tracking
Fig. 6. Tracking Results: (a) color-based particle filter, (b) short-term memory-based tracking
Fig. 7. Multiple object tracking result
604
H.-B. Kang and S.-H. Cho
To illustrate the differences between the color-based particle filter [6] and our proposed algorithm, we used two sequences. One sequence is the vending machine sequence in which one person goes to the vending machine and then goes back to his office. Another sequence is the corridor sequence in which one person turned around the corner and then disappeared. Shortly after, he reappeared at the original place. Figure 5 shows the tracking results of the vending machine sequence using two methods. In Figure 5(a), color-based particle filter method [6] was used and the ellipse was tracked incorrectly because of sudden appearance changes. Our proposed algorithm like Figure 5(b) worked well in the case of sudden appearance changes. The update weight of target model was 0.4 and threshold value was set to 0.9 and the size of attention span of the memory was 3. Figure 6 shows the tracking results of corridor sequence using two methods. Figure 6(a) shows the result of color-based particle filter method [6] and Figure 6(b) shows the result of our proposed algorithm. In Figure 6(a), the tracking failed when the person returned. However, our tracking method can handle this kind of situation pretty well as is shown in Figure 6(b). We also test multiple object tracking using our methods. As shown in Figure 7, our tracking result shows good performance in the case of multiple object tracking.
5 Conclusions In this paper, we proposed a noble tracking method to handle sudden appearance changes and occlusions. A short-term memory model is proposed to keep different appearance models in the tracking process. New update methods are also designed. We performed a variety of non-rigid object tracking experiments and the proposed system showed a strong robustness even when sudden appearance changes occurred. Compared with other algorithms, our proposed system shows a better and more robust tracking performance. The proposed memory model can be extended to handle multiple humans tracking for intelligent video surveillance system.
Acknowledgements. This work was supported by the Catholic University of Korea Research Fund 2003.
References 1. Comaniciu, D. and Meer, P.:Real-Time Tracking of Non-Rigid Objects Using Mean Shift, proc. IEEE Conf. Computer Vision and pattern Recognition, vol ll, pp. 142-149, June. (2000) 2. Comaniciu, D. and Ramesh, V., Meer, P.:Kernel-Based Tracking, IEEE trans. Pattern Analysis and Machine Intelligence, vol 25, no. 5, pp. 564-577.(2003) 3. Isard, M., A. Blake, A. :Contour Tracking by Stochastic Propagation of Conditional Density, European Conference on Computer Vision, pp343-356.(1998)
Short-Term Memory-Based Object Tracking
605
4. Isard, M. Blake, A.: CONDENSATION – Conditional Density Propagation for Visual Tracking, International Journal on Computer Vision 1 (29), pp5-28.(1998) 5. Nummiaro, K., Koller-Meier, E. and Van Gool, L.: A Color-Based Particle Filter, First International Workshop on Generative-Model-Based Vision, in Conjunction with ECCV’02, pp53-60. (2002) 6. Nummiaro, K., Koller-Meier, E. and Van Gool, L.: Object Tracking with an Adaptive Color-Based Particle Filter, Symposium for Pattern Recognition of the DAGM, pp.353-360. (2002) 7. Kalman, R.:New Approach to Linear Filtering and Prediction Problems, Transactions of the ASME, Series D, Journal of Basic Enginerring, 82(1):34-45. (1960) 8. Jepson, A., Fleet, D., and El-Maraghi, T.: Robust Online Appearance Models for Visual Tracking, IEEE Trans. Pattern Analysis and Machine Intelligence, vol 25, no. 10, pp.12961311. (2003)
Real Time Multiple Object Tracking Based on Active Contours Sébastien Lefèvre1 and Nicole Vincent2 1
LSIIT – University Louis Pasteur (Strasbourg I) Parc d’Innovation, boulevard Brant, BP 10413, 67412 Illkirch Cedex, France lef evre@lsiit. u–strasbg. fr 2
CRIP5 – University René Descartes (Paris V) 45, rue des Saints Pères, 75270 Paris Cedex 06, France nicole.vincent@math–info.univ–paris5.fr
Abstract. In this paper our purpose is to present some solutions to multiple object tracking in an image sequence with a real time constraint and a possible mobile camera. We propose to use active contours (or snakes) modelling. Classical active contours fail to track several objects at once, so occlusion problems are difficult to solve. The model proposed here enables some topology change for the objects concerned. Indeed a merging and a splitting phases are respectively performed when two objects become close together or move apart. Moreover, these topology changes help the tracking method to increase its robustness to noise characterized by high gradient values. In the process we have elaborated, no preprocessing nor motion estimation (which are both time consuming tasks) is required. The tracking is performed in two steps that are active contour initialisation and deformation. The process supports non-rigid objects in colour video sequences from a mobile camera. In order to take advantage of compressed formats and to speed up the process when possible, a multiresolution framework is proposed, working in the lowest-resolution frame, with respect to a quality criterion to ensure a satisfying quality of the results. The proposed method has been validated in the context of real time tracking of players in soccer game TV broadcasts. Player positions obtained can then be used in a real time analysis tool of soccer video sequences.
1 Introduction Object tracking is a key step in automatic understanding of video sequence content. When objects are non-rigid, an appropriate tracking method should be used. Among methods that have been proposed, we can mention deformable models and active contours (or snakes). As we are focusing on approaches characterized by a low computational cost, we will choose active contours. Different active contour models have been proposed since the original model by Kass et al. [1] called snakes. This model has shown several limitations such as initialisation, optimal parameter setting, computational cost, and unability to change its topology. Some authors have proposed other models, among them we can mention geodesic contours [2] which allow to deal with topology changes. As we are focusing on real time tracking, we do not consider approaches as those based on level sets [3], more powerful but also with a higher computational cost. However, snakes execution in a real time framework is still a challenge. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 606–613, 2004. © Springer-Verlag Berlin Heidelberg 2004
Real Time Multiple Object Tracking
607
The method we are proposing here is based on snakes but original limitations are turned away. To do so, the model considers some original energies and a two step tracking for every frame. Moreover, and contrary to other approaches, the method does not require any preprocessing nor motion estimation or compensation. Different optimisations help to obtain a real time processing. After an introduction to the snake model, we will precise the energies and describe the two step tracking algorithm. Then we will present merging and splitting steps which let the snake change its topology and make possible a multiple object tracking. The multiresolution framework will also be described. Finally, we will illustrate our contribution with some results obtained on soccer videos.
2 Active Contours Here we will recall main active contour models and we will give a short review of object tracking methods based on active contours. An active contour can be represented as a curve, closed or not, evolving through time. The curve is deformed in order to minimize an energy function:
where and represent the internal and external energies defined themselves as combinations of energies. A description of the most common energies is given in [4]. Several active contour implementations have been proposed: variational calculation, dynamic programming, or the greedy algorithm. It has been shown [5] that greedy algorithm [6] is 10 to 80 times faster than other methods, so we will focus on this approach. In the discrete domain, the definition of the snake energy function is:
V denotes the discrete active contour and its point from a set of points. Contour V iteratively evolves and represents the active contour V at iteration A minimal energy point is selected in the current point neighbourhood and a move is performed. This iterative deformation process is performed on each point of the curve, until convergence occurs. In this section we have presented the main principles of active contours. We will now describe the tracking algorithm we are proposing, and the energies it relies on.
3 Tracking Method and Real Time Constraint The goal of the presented method is to track in real time non-rigid objects in colour frames acquired with a moving camera. In order to minimise the computation time, we had to take several decisions which differentiate our contribution from existing methods. First, we have decided to perform active contour deformation without any preprocessing.
608
S. Lefèvre and N. Vincent
Second, no camera motion compensation is performed. Moreover, we do not estimate the motion of the different tracked objects. Finally, gradient computation is limited to an area around the initial position of the tracked object. We will now briefly present the energies used in our model and the two steps tracking method.
3.1 Energies Definition We consider continuity, balloon, and curvature energies that are very common internal energies. The external energy allows to link the active contour to image content, and is also composed of several energies. Here we have considered two energies based on gradient and colour information. The first tends to fit the contour to real edges of objects. We estimate the gradient of a colour image by the sum of gradients computed on the different colour components using Sobel operator. The second external force allows the snake to stay on the borders of the tracked object. To do so, it is defined using a priori information on the background colour features. In the case of homogeneous background, it is possible to compute the background average colour. The energy is defined here in every point of the image as the difference between the considered point and the background average colours. In order to limit the sensitivity to noise, the value obtained is thresholded. If the background is not homogeneous, in the case of a static camera, it is possible to use a reference frame to achieve pixel colour comparison.
3.2 Two Step Tracking Method The tracking method is composed of two steps performed successivelly on every frame: first the snake is initialised using result from previous frame, then it is deformed. The first step consists in the snake initialisation on the current frame. We enlarge a rectangle R[0] with borders parallel to image borders and surronding the final snake obtained at previous frame, so a priori including the contour to be obtained on the current frame. We set its points regularly on the rectangle contour. Then a single object can be tracked using the forces described previously. Its position at time is computed using its previous position Here the balloon force is used to help the snake to retract itself instead of expand. We can notice no motion estimation of the tracked object is then required. On the first frame of a video sequence, the initial rectangle is obtained from a background / foreground segmentation process [7]. In order to track in real time several objects, we have introduced several improvements using snake splitting or merging.
4 Multiple Object Tracking The tracking method described here works even in the case of a moving camera. However, the tracking may fail if several moving objects have close spatial positions, more precisely when tracking object the process fails if: The initial snake (rectangle) will then include both objects and (figure 1). After a brief review of existing approaches considering multiple objects, we will describe more precisely our solution. Therefore, in order to deal with multiple objects in the video sequence, it is necessary to bring to the model the ability to change its topology.
Real Time Multiple Object Tracking
609
Fig. 1. Tracking failure with constant topology (left). Main steps in the splitting process (right).
4.1 Some Approaches Allowing Topology Changes Cheung et al. [8] distinguish between methods with explicit use of a split-and-merge technique and those based on a topology-free representation as level set. In the T-snakes [9] of McInerney and Terzopoulos, a binary grid is associated with the image. From the grid points they determine the local positions of the topology changes to be performed. Velasco and Marroquin [10] initialise the snakes from pixels characterized by highest gradient values. Snakes are merged if their points have close positions. Ji and Yan [11] study the loops present in the contour. The procedure introduced by Choi et al. in [12] compares energies with a threshold at every iteration. Perera et al. [13] check the topological validity of the snake at every iteration. Delingette and Montagnat proposed in [14] to study crossings between two contours, and then to apply some topological operations to merge two snakes or to create new ones.
4.2 Justifications Let us formalise the problem to be solved here in order to bring it a solution. Let us note the shape of interest in the frame. It is tracked by a snake, noted When an occlusion phenomenon occurs, the shape actually represents two objects. Let us consider that the occlusion is finished at time We are then in presence of several shapes. We limit ourself to the case of two disjoint shapes and Nevertheless, the same arguments can hold when more than two shapes are present. The properties of these two shapes and can be stated as:
However, the snake without this information, is still considering one shape (see figure 1). As we are now in presence of two shapes and we have to define two appropriate snakes and The problem to be solved can then be expressed as the search for a transform T which splits a snake into two snakes and modeling respectively and In the same way, a merging can be performed to gather several snakes in a unique one if necessary. However, equation (5) is not an equality, so some parts of can be associated neither with nor with Indeed, the shape may contain background in between the two disjoint shapes and At the end of the splitting process, some contours may model shapes of no interest. So it is necessary not to take them into account. To do so, we have to identify some features of the contours to be able to take a decision.
610
S. Lefèvre and N. Vincent
4.3 Principle of Topology Change From the previous formalism, several additionnal steps are necessary in the tracking algorithm: a splitting step, a decision step which will allow to keep only interesting contours, and a merging step. In order to limit the computation time, these different steps are performed only once per frame, when convergence has been achieved using previous active contour algorithm. The main steps in the splitting process are illustrated in figure 1. The splitting goal is to divide the snake in several contours. From equation (2), the energy obtained is a minimum. As we are using a discrete and local approach, at each of the points is a local optimum. As we will see further, we would like to give the same importance to internal and external energies. External energies does not always get a minimum. Indeed, they have been thresholded in order to increase robustness to noise. Then they can be uniformly equal to zero. After the snake has converged, some points can be trapped by these areas. So we propose to delete these points and to split the active contours at the positions of these incorrect points. From the left point list, each sequence of successive points is used to create a new closed contour. The splitting step leads to define from an initial snake several new contours. But this set of new contours may contain some snakes which fit on pixels corresponding to noise or background. So it is followed by a decision step whose goal is to determine the contours of interest. Size and shape of new potential contours are involved in the criterion we define to test the pertinence of a contour noted the area delimited by the contour V noted area(V) is neither too small nor too large (i.e. and both width and height of the circumscribed rectangle are not too small. The splitting process described previously requires the definition of a corresponding merging process. This merging step will be performed if two objects (each of them being tracked by a snake) become closer until an occlusion phenomenon is obtained. In this case, an unique snake has to be used to model these objects. The merging process is then launched when two snakes are characterized by close gravity centers. Here we have described how a splitting / decision / merging step of the active contour allows to deal with topology changes, to increase tracking robustness, and to ensure a simultaneous tracking of several objects. In the following section, we will show how a multiresolution analysis of video sequences frames can be performed to limit the computation time of the tracking algorithm based on active contour.
5 Multiresolution Analysis In order to limit computation time, we propose to adapt our original snake model to analyse video frames through a coarse-to-fine multiresolution framework. The multiresolution analysis is not performed until original resolution but it stops when a criterion is verified. Moreover, we automatically adapt some model parameters to the resolution.
5.1 An Incomplete Multiresolution Process Several authors proposed to model active contours through a multiresolution framework (e.g. [15]). Snake evolution is then performed on a coarse-to-fine approach. The snake
Real Time Multiple Object Tracking
611
is first deformed at a low resolution then the result obtained is used as the initial snake which will be deformed at a finer resolution (equal to This process is repeated until the original resolution. Here the multiresolution framework considers the image instead of the snake. Every frame is analysed at different resolutions, starting from the lowest resolution, i.e. If the previous method does not allow to obtain a correct final contour according to the decision criterion the image is then analysed at a finer resolution, i.e. The size of the image increases in an exponential way. The definition and the use of a stopping criterion linked to the quality of the results limit here the number of resolutions analysed. By this way, the computation time is also limited. This choice is particularly interesting when the contour obtained at a low resolution is sufficient to process correctly the tracking task. The algorithm proposed here is able to process images at different resolutions, from original resolution to the lowest resolution Most often, the tracking is performed correctly all along the video sequence on images reduced with a ratio
5.2 Parameter Robustness Towards Resolution Changes In order to ensure robustness of the algorithm towards resolution changes, we made some parameters depend on the image resolution. However the energy coefficients and the neighbourhood size do not depend on the resolution level. The size of the initial rectangle for the snake must obviously not be constant, as a resolution decrease implies a size decrease of the objects present in the image. Noting the size of the rectangle at original resolution
we get:
The same evolution function can be applied to the number of points belonging to the snake, which also depends on the number of image pixels. As neighbourhood is constant whereas the number of image pixels is variable, the deformation process will converge more or less quickly. Finally, gradient computation properties are not resolution independent. Indeed, the successive averaging of pixels results in an image smoothing, so threshold has to be adaptive.
6 Results and Discussion We have introduced different improvements which help us to deal correctly with topology changes and to limit computation time using a multiresolution analysis of video frames. In this section, we will first indicate the different parameters and we will explain how they can be set efficiently. Then we will present some results obtained with these parameters on soccer video sequences. The proposed method has been tested on outdoor scene video sequences characterized by a relatively uniform background. The size of colour images is equal to 384 × 284 pixels and the acquisition framerate is 15 Hz. The snake is initially composed of
612
S. Lefèvre and N. Vincent
Fig. 2. Interest of splitting/merging steps in the case of close objects and temporary occlusions.
Fig. 3. Non-rigid object tracking at a resolution 256 times lower than the original one.
points at the original resolution This parameter has a direct influence on both result quality and computation time. When the application does require only the object position the number of points can be decreased. At the initial resolution the number of iterations is set equal to 30. However, the contour converges most of the time before. The coefficients used to weight the different energies have all been set to 1. It contributes to limit the number of operations (multiplications) and to greatly help in parameters setting. The threshold used in gradient computation has been set to 500 at the original resolution. It will be compared with the sum of gradient modules computed on colour channels with the Sobel operator. In this context, the computation time required on a PC 1.7 GHz is about 35 milliseconds per frame. Figures 2 and 3 illustrate the tracking of non-rigid objects (soccer players) during a video sequence. The algorithm enables us to track a moving object in a moving environment, without object motion estimation nor camera motion compensation. Figure 2 illustrates the principle of the splitting and merging steps. So it is possible to track independently the different objects present in the scene. However, the sensitivity of the active contour model to a complex background (containing some pixels characterized by high gradient values) stays high. The multiresolution analysis described in the previous section is illustrated in figure 3. The resolution leads to an image size 256 times lower than the original one. We can observe the lack of precision in the snake shape.
7 Conclusion In this article, we dealt with the problem of non-rigid object tracking using snakes. Our tracking method can be performed in real time on colour video sequences acquired with a moving camera. The method has been validated on TV broadcast of soccer games. In order to limit the sensitivity of the model to initialisation settings, our original approach initialises a rectangular snake, and then reduces it around the object. So the tracking is robust to initialisation conditions. In order to deal with topology changes, we
Real Time Multiple Object Tracking
613
have introduced a splitting process, which allows to track different objects. Finally, the constraint which is the hardest to take into account is the computation time. We have combined different optimisation techniques: gradient is computed only once per frame and only on the area of interest, costly processing are not performed (global filtering or preprocessing, object motion estimation, camera motion compensation) and the images are analysed through a multiresolution framework. We would like now to involve in our model some more robust colour or texture energies. We also consider to implement the proposed algorithm on a multiprocessor workstation in order to further limit the required computation time.
References 1. Kass, M., Witkins, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1 (1988) 321–331 2. Paragios, N.: Geodesic Active Regions and Level Set methods: Contributions and Applications in Artificial Vision. Phd dissertation, Université de Nice Sophia-Antipolis (2000) 3. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge Univ. Press (1999) 4. Davison, N., Eviatar, H., Somorjai, R.: Snakes simplified. Pattern Recognition 33 (2000) 1651–1664 5. Denzler, J., Niemann, H.: Evaluating the performance of active contours models for real-time object tracking. In: Asian Conference on Computer Vision, Singapore (1995) 341–345 6. Williams, D., Shah, M.: A fast algorithm for active contours and curvature estimation. Computer Vision, Graphics and Image Processing: Image Understanding 55 (1992) 14–26 7. Lefèvre, S., Mercier, L., Tiberghien, V., Vincent, N.: Multiresolution color image segmentation applied to background extraction in outdoor images. In: IS&T European Conference on Color in Graphics, Image and Vision, Poitiers, France (2002) 363–367 8. Cheung, K., Yeung, D., Chin, R.: On deformable models for visual pattern recognition. Pattern Recognition 35 (2002) 1507–1526 9. McInerney, T., Terzopoulos, D.: T-snakes: Topology adaptive snakes. Medical Image Analysis 4 (2000) 73–91 10. Velasco, F., Marroquin, J.: Growing snakes: Active contours for complex topologies. Pattern Recognition 36 (2003) 475–482 11. Ji, L., Yan, Y.: Loop-free snakes for highly irregular object shapes. Pattern Recognition Letters 23 (2002) 579–591 12. Choi, W., Lam, K., Siu, W.: An adaptative active contour model for highly irregular boundaries. Pattern Recognition 34 (2001) 323–331 13. Perera, A., Tsai, C., Flatland, R., Stewart, C.: Maintaining valid topology with active contours: Theory and application. In: CVPR, USA (2000) 496–502 14. Delingette, H., Montagnat, J.: Shape and topology constraints on parametric active contours. Computer Vision and Image Understanding 83 (2001) 140–171 15. Ray, N., Chanda, B., Das, J.: A fast and flexible multiresolution snake with a definite termination criterion. Pattern Recognition 34 (2001) 1483–1490
An Object Tracking Algorithm Combining Different Cost Functions D. Conte1, P. Foggia2, C. Guidobaldi2, A. Limongiello1, and M. Vento1 1 Dip. di Ingegneria dell’Informazione ed Ingegneria Elettrica Università di Salerno, Via Ponte don Melillo, I84084 Fisciano (SA), ITALY 2
Dip. di Informatica e Sistemistica Università di Napoli, Via Claudio 21, I80125, Napoli, ITALY
Abstract. This paper presents a method for tracking moving objects in video sequences. The tracking algorithm is based on a graph representation of the problem, where the solution is found by the minimization of a matching cost function. Several cost functions have been proposed in the literature, but it is experimentally shown that none of them, when used alone, is sufficiently robust to cope with the variety of situations that may occur in real applications. We propose an approach based on the combination of cost functions, showing that it enables our system to overcome the critical situations in which a single function can show its weakness, especially when the frame rate becomes low. Experimental results presented for video sequences obtained from a traffic monitoring application, confirm the performance improvement of our approach.
1 Introduction During the last decade, the Computer Vision community has shown an increasing interest in object tracking, applied to contexts like video surveillance, traffic monitoring animal behavior observation. For these applications, a video-based tracking system would have the significant advantage of a relatively simple hardware set-up on the field (one or more properly placed cameras), while alternative technologies would involve a more invasive sensor placement. On the other hand, only in recent years the computing power needed for real-time video processing has become sufficiently available and affordable for dealing with this kind of applications. The task of a vision-based tracking system can be coarsely split into three subtasks: object detection, devoted to detect and to segment the moving objects from the background looking at a single frame; object tracking, whose aim is to preserve the identity of an object across a sequence of frames, following the movements and the changes in the appearance (due for example to a change of orientation or posture) of the object itself; application event detection, that uses the results of object tracking to recognize the events that must be handled by the application. In this paper we will focus our attention only on the first two subtasks, since the third one is application dependent.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 614–622, 2004. © Springer-Verlag Berlin Heidelberg 2004
An Object Tracking Algorithm Combining Different Cost Functions
615
In the literature, there are three conventional approaches to automated moving object detection: temporal differencing approaches [1], background subtraction approaches [7], and optical flow approaches [4]. In background subtraction techniques the difference is performed between current frame and a representation of the scene without any moving objects (reference image). A pixel is considered as belonging to foreground if the difference is greater than a threshold. A very popular enhancement of this approach is adaptive background subtraction [7], where the reference image is continuously updated to follow environmental dynamics. For the tracking layer, the approaches proposed in the literature can be divided into four main categories: region-based, model-based, contour-based and feature-based. Region-based methods consider connected components of the foreground, and try to establish a correspondence between the regions in adjacent frames [3]. This correspondence is based on some similarity measure, usually involving information such as position, size, overlap, intensity or color, correlation. Sometimes a predictive model of these values is obtained by means of Kalman filters or similar predictions techniques. In model-based methods, the algorithm starts with a model of the object to be tracked. The tracking consists in looking for instances of the model within the image [9]. In contour-based approach the objects are represented by their contour. The contour motion can be described using Kalman models [5] or Markov Random Fields [8]. The last family of tracking methods is the feature-based approach. In this approach, instead of whole objects, the tracking algorithms work on distinguishable points or lines, called features [6]. The motion of the features is followed by means of feature tracking algorithms. Then, a feature grouping operation is needed to construct object identities. This paper faces the definition and the performance assessment of a tracking method. Our tracking layer follows the region-based approach, and is laid on top of an adaptive background subtraction algorithm. We have factored the tracking problem into two subproblems: the first is the definition of a suitable measure of similarity between regions in adjacent frames. Provided with this measure, the second subproblem is the search for an optimal matching between the regions appearing in the frames. As regards the first subproblem (the definition of a similarity measure), we propose several different metrics, jointly used during the detection phase, according to a sort of signal fusion approach. It is well known that under ideal conditions (high frame rate, well-isolated objects) the measures based on position or overlap show a very good performance, but ambiguities generally arise when multiple objects are close to each other, especially if the frame rate becomes low. The solution we propose is based on the use of a Multi-Expert approach, in which the position information is combined with a different measure to overcome its limitations. We have developed a method to optimize such a combination in order to fit the requirements of a specific application, starting from a set of training data. We have also made a comparative benchmarking of several different metrics, in order to obtain some experimental evidence about their relative effectiveness in the application domains considered. The subproblem of the optimal matching has been formulated in a graph-theoretic framework, and has been reduced to a Weighted Bipartite Graph Matching, for which
616
D. Conte et al.
a standard algorithm has been used. The main advantage of this approach is the ability of the algorithm, in case of ambiguities, to make a choice guided by a global criterion. The paper is organized as follows: Sect.2 presents the first layer of the system we have developed. Sect.3 is devoted to our object tracking layer, describing both the matching algorithm and the considered metrics. Sect.4 describes our experimental framework, together with a presentation and discussion of the results obtained in the application domain of traffic monitoring. Conclusions are drawn in Sect.5.
2 Moving Object Detection The first step in an object tracking system is the detection of the foreground (moving) objects in the image. For this task we have developed an algorithm based on the adaptive background subtraction. The key aspects of an adaptive background subtraction method are the choice of the threshold for comparing the current frame with the reference image, and the policy used for updating the reference image over time; in the following we will describe our solutions to these problems. As regards the choice of the threshold, our algorithm differs from the basic approach for the introduction of a dynamic strategy to update the threshold in order to adapt to the dynamics of background changes. In detail, the overall intensity of the current frame is compared with a moving average computed over the recent frames. If the current intensity is significantly different from the average, the threshold is increased; otherwise, it is decreased. Let us turn our attention to the reference image update. The simplest way to perform this task is a linear combination between the reference image and the current frame: where the parameter is chosen according to the desired updating speed. However, if the update is too slow, the system will not be able to deal with scene or illumination changes; on the other hand, if the update is too fast, an object that moves slowly or stands still for little time will be unduly included in the reference image. For this reason we have devised an improved updating rule, that updates very slowly the regions of the reference image that correspond to foreground pixels of the current frame and fast those that correspond to background pixels. In this way the algorithm is able to promptly follow changes in the lighting of the scene. Also in order to follow the trajectories of the moving objects in the video-sequence, the identification of the objects within the foreground region is required. To this aim, we use a standard algorithm[2] for detecting the connected components in an image with two values pixels (1 for foreground pixels, 0 otherwise). Each connected foreground component is considered a detected object and described by means of its bounding box. Especially in outdoor environments, it is likely that also the shadow cast by an object is considered by the detection algorithm as part of the object. This can cause two kinds of problem: first, the size and shape of the object is not reconstructed correctly; second, the shadow may touch a nearby object, causing the algorithm to consider the two objects as one. For this reason, we have introduced an algorithm to remove the shadows, based on the histogram of the foreground pixels. After the shadow removal algorithm, the connected component analysis correctly detects two moving objects.
An Object Tracking Algorithm Combining Different Cost Functions
617
3 The Tracking Algorithm and the Metrics Generally speaking, tracking objects in a video-sequence can be described as follows. If two frames are in succession, it is highly probable that the objects identified in the first frame have a correspondence with the objects identified in the second frame. The tracking is the identification of this correspondence. Corresponding objects can be more or less translated, and the amount of translation depends on both the object speed and on the frame rate. Moreover, we can also have objects appearing in only one of the two frames, because they are entering or leaving the scene. Formally a tracking algorithm can be described in the following way. A number of objects is identified in each frame and a different label is associated to each object. Let be the set of boxes belonging to the frame t, moreover let be a set of labels. According to this assumption, it is possible to built a mapping among the bounding boxes of the frame t-1 and a subset of L such that labels are uniquely assigned the objects of the set Let now us consider two consecutive frames of a video sequence: the frame t-1 and the frame t. The solution of the tracking problem, between two successive frames, is an injective mapping between the sets and In particular we want to determine the mapping such that labels are uniquely assigned to the objects of the set and that if the object is correspondent to it will receive the same label. If a new object appears in the field of view of the camera, a new label never used for labeling the boxes of is assigned to it and is called new box. Moreover, if the object disappear from the field of view, it has no correspondent with an element of as a consequence it is not considered anymore and its box is cleaned. The problem at hand can be represented by using a matrix whose rows and columns are respectively used to represent the objects of the set and the objects of the set (correspondence matrix). So, a solution of a tracking problem can be simply described: if the element (i, j) of the matrix is 1, the label assigned to the element is the same as the label of Since there exists no duplicate label in set the each row and each column contains at most once the value 1. Computation of The object tracking problem can be solved by computing a suitable injective mapping solving a suitable Weighted Bipartite Graph Matching (WBGM) problem. A Bipartite Graph (BG) is a graph where nodes can be divided into two sets such that no edge connects nodes in the same set. In our problem, the first set is while the second set is Before the correspondence is determined, each box of the set is connected with each box of the set thus obtaining a Complete BG. Each box is uniquely identified in the set by its label. An assignment between two sets and is any set of ordered pairs whose first element belongs to and whose second element belongs to with the constraint that each node appears at most once in the set. A maximal assignment, i.e. an assignment containing a maximal number of ordered pairs is known as a matching (BGM). Each edge of the complete bipartite graph is labeled with a cost. This cost function takes into account how similar are the two boxes and The lower is the cost, the more suitable is that edge. If the cost of an edge is higher than a fixed threshold, it is considered unprofitable and raised to so that it cannot be included into an
618
D. Conte et al.
assignment. The value of the threshold can be evaluated by maximizing the tracker performance over a training set of video sequences. The cost of an assignment is the sum of the costs of all the edges included in the matching. So, given two assignments and is preferable to if contains more ordered pairs than and its cost is lower of the cost of The problem of computing a matching having minimum cost is called Weighted BGM. Fig.1 illustrates the solution of a WBGM problem.
Fig. 1. A solution of a WBGM problem. The mapping has been determined. The equality holds for those boxes representing the same object in the two frames. New boxes receive new labels in the example)
Cost Functions. Many methods have been proposed to build categories of cost functions that are suitable for the object-tracking problem. It is possible to identify at least 3 categories of cost functions: position, shape and visual. The position cost functions are easy to compute but are adequate only if the frame rate is sufficiently high or if the objects motion is slow. If the objects move too fast, the measure of the position of the object is not anymore reliable, because the box representing an object can shift enough to be confused with a different object. To this category we can ascribe the following two cost functions: overlap and distance. - Overlap. The cost element (i,j) is If the area of is equal to the area of and also it is in the same position, that the value of the cost element is 1. If there is no overlap, the cost element is - Distance. The cost element (i,j) the euclidean distance between the center of and If the distance is greater than a threshold, the cost is Shape cost functions consider the similarity of the objects, independently from their location in the frame. These cost functions remain reliable also for fast moving boxes and for slow frame rates. Their main drawbacks are that in many circumstances the shape of an object is not stable, and in a scene there can be many objects with a very similar shape. A simple cost function of this kind can be defined as follows: - Dimension. The height h and the width w of each box is considered. The cost element (i, j) the euclidean distance between the (h, w) vectors of and If the distance is greater than a threshold, the cost is Visual cost functions consider two boxes similar if they look close from the perceptual point of view. They are position and shape independent, thus useful when the frame rate is low or the objects move fast. The main drawback is that the computational complexity is higher than the other categories; furthermore, their discriminant ability decreases dramatically when the illumination or the sensibility of the camera is low, and are also inadequate in those contexts where several similar objects are present simultaneously. We report some of the defined cost functions of this category: - Gray Level. The average of the components red, green and blue of each pixel is computed. The histogram is then computed (histogram of brightness), and its correlation is obtained. The cost function is computed as
An Object Tracking Algorithm Combining Different Cost Functions
- Color. Let as follows.
619
the histogram of the color c. We compute the correlation index
is the average of three indexes, computed on the histograms of red, green and blue. The value is 1 if the colors of the two boxes are identical for each pixel. It is 0 if the colors of the boxes are completely different. The cost function is l is the number of levels of the histogram. Combination of Cost Functions. In real applications, the tracking problem is often too complicated to be solved using only one cost function, and it is likely to assume that some improvements can be achieved by using a suitable combination of them. In particular, according to its nature, a cost function based on the evaluation of position is effective in many cases, except when the frame rate is low and or the object moves quickly. So, our idea is to combine this commonly used cost function, with others chosen among the group of shape and the visual functions. The cost function has been computed as the weighted sum of the two cost functions by means of a parameter In our experiments, is always the distance cost function; has been chosen among the defined cost functions of the other categories: dimension, gray level, and color cost functions.
4 Experimental Results Our experiments have been performed using two video sequences from a traffic monitoring application, each sampled at two different frame rates. Each of the four obtained frame sequences has been divided into a training set (TRS) and a test set (TS); Tab.1 presents summary information about the data. It can be noticed that the frame rates are quite low, reflecting the fact that the system is intended to work on inexpensive hardware, where the computational load of the object detection phase is a limiting factor on the attained speed. We can expect that a traditional approach based only on the distance cost function will suffer from the problems outlined in Sect. 3.
We have performed our experiments separately on each video sequence, in order to measure the performance of the proposed approach in a well characterized context. Furthermore, in order to appreciate the robustness of the method we have performed
620
D. Conte et al.
an experimentation using the concatenation of all the sequences as a single video; in this way we can check if a single set of parameters can still deliver a reasonable performance when the environment conditions change. In order to provide a quantitative measure of the tracker performance, we adopted the following classification of the tracker outcomes: a true positive (TP) is an association found by the tracker, also present in the ground truth; a true negative(TN) is an object labeled as new by both the tracker and the ground truth; a false positive(FP) is an association found by the tracker, missing in the ground truth; a false negative(FN) is an object labeled as new by the tracker, but not new in the ground truth. The goal of the tracker is to maximize true positives and true negatives, while minimizing false positives and false negatives. In our experimentation, we have considered equivalent these two classes of errors (FP+FN); so we have used as our evaluation criterion the single performance index defined as: P = (TP+TN)/(TP+TN+FP+FN). The first step in our experimentation has been the determination of the optimal threshold for each of the four considered cost functions. To this purpose, we have applied our tracking system on each training set, using only one of the considered cost functions, varying the value of the threshold from 0 to the maximum of the cost function, and chosing the threshold maximizing P. Tab.2a reports the results of this experiment.
The second phase of the experimentation has been the determination of the optimal value of the combination parameter mentioned in Sect.3. First, we normalized the values of each cost function, scaling them by a factor ensuring that they have the same order of magnitude over the TRS. Then, for each combination investigated (distance + color, distance+gray level, distance+dimension) we performed a search of the optimal value of by evaluating the performance index over the training set varying from 0 to 1 with a step of 0.01. Results are presented in Tab.2b. Once the parameters for each combination have been fixed, the obtained combinations have been validated on the TS, for measuring the improvement with respect to the cost functions taken separately. Results are presented in Fig.2. It can be seen that on the TS there are a few cases where some of combinations on a single video sequence performs worse than the distance cost function alone. This is happens mostly for the distance + dimension, and can be explained with the problems of the shape cost functions outlined in Sect.3. However, if we consider the concatenation of the video sequences (sequence 5), all the three combinations outperform the distance, and two of them with a significant margin. In particular, the distance + gray level proves to be
An Object Tracking Algorithm Combining Different Cost Functions
621
the best, attaining a 10% improvement (reaching a performance of 0.92, against 0.82 of the gray level alone and 0.80 of the distance alone). This performance is followed by distance + color. This result is still more remarkable because the parameters used for the tracking have not been optimized separately on each video sequence, but are obtained on a global TRS. Thus, the improvement attained by the proposed method is sufficiently general to be exploited in contexts where the conditions of the scene are not uniform over time.
Fig. 2. The performance on the test set, for the combinations: distance + color (a); distance + gray level (b); distance + dimension (c)
5 Conclusions In this paper we discussed an object tracking method based on a graph theoretic approach, depending on the definition of a suitable cost function. We demonstrated that by using a simple combination between two different cost functions, it is possible to improve the results with respect to any single cost function. A future development of our proposed method will involve the adoption of a more refined combining scheme, in which the weights of the cost functions being combined will not be fixed, but will be adapted dynamically to the current conditions of the scene.
References 1. Anderson, C., Burt, P., van der Wal, G.: Change detection and tracking using pyramid transformation techniques. In Proc. SPIE Intell. Rob. and Comp. Vis., Vol. 579 (1985) 7278 2. Ballard D. H., Brown C., “Computer Vision”, Prentice-Hall, 1982. 3. Collins, R.T., Lipton, A.J., Fujiyoshi, H., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE, Vol. 89 – 10 (2001) 1456-1477 4. Halevy G., Weinshall, D.: Motion of disturbances: Detection and tracking of multi-body nonrigid motion. Mach. Vis. Applicat., Vol. 11-3 (1999) 122–137
622
D. Conte et al.
5. Peterfreund, N.: Robust tracking of position and velocity with Kalman snakes. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 21-6 (1999) 564-569 6. Richardson, H.S., Blostein, S.D.: A sequential detection framework for feature tracking within computational constraints. Proceedings of the IEEE Conf. on CVPR (1992) 861-864 7. Stauffer C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22 (2000) 747-757 8. Xu, D., Hwang, J. N.: A Topology Independent Active Contour Tracking. Proc. of the 1999 IEEE Workshop on Neural Networks for Signal Processing (1999) 9. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: real-time tracking of the human body. IEEE Transaction on PAMI., Vol. 19-7 (1997) 780-785
Vehicle Tracking at Traffic Scene with Modified RLS Hadi Sadoghi Yazdi1, Mahmood Fathy 2, and A. Mojtaba Lotfizad1 1
Department of electrical engineering, Tarbiat Modares university, Tehran-Iran, PoBox:14115-143
[email protected],
[email protected]
2
Colledge of computer engineering, Science and Technology, Tehran-Iran
[email protected]
Abstract. The multi object tracking algorithms is based on prediction. One of the most commonly used algorithms in prediction is the RLS algorithm. This algorithm because of its good convergence rate has many applications. But RLS algorithm tracks non exact and noisy measurements the same as it tracks the signal. In this work, with appropriate the combination of the RLS and the MAP, an RLS algorithm with filtered input is presented. In this algorithm the MAP estimation is used as an input filter to the RLS algorithm for mitigating the noise effect. In order to determine the mean of the noise in MAP algorithm, we use a recursive method based on the RLS error. It can be proved that the mean square error in the proposed algorithm which we call it Modified RLS (MRLS) is reduced at least to the same amount as the conventional RLS algorithm. This method is tested in two different areas, namely, the prediction of a noisy sinusoidal chirp signal and multiple objects tracking of vehicles in the traffic scene.
1 Introduction A lot of research has been done on intelligent transportation systems that from its result are the surveillance of road traffic based on machine vision techniques. Although the traffic control is based on the global traffic flow, nevertheless local data checking like individual behavior of vehicles has many applications. Normal behavior determining and identifying offender drivers are from its applications. Analysis of behavior of any vehicle is possible by using the obtained trajectory of vehicles and quantitative parameters like velocity and acceleration of motion and with them normal or abnormal behavior can be identified. Vehicles trajectory is an important feature for behavior recognition, so in many research works, vehicle tracking has been studied despite many difficulties such as full or partial occlusion [2, 3, 19, 20, 14, 15, 17]. Tracking moving objects is performed by predicting the next position coordinates or features. One of the prediction tools is the adaptive filter. The Kalman filter and the RLS algorithm are from this family. The Kalman filter is an optimum filter which A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 623–632, 2004. © Springer-Verlag Berlin Heidelberg 2004
624
H.S. Yazdi, M. Fathy, and A.M. Lotfizad
is model based and minimizes the variance of the estimator error [1-2]. In practice, the motion model of objects inside the scene does not exist and it is possible to be suggested either fixed or variable. The need of the Kalman filter to a model is one of problems when using this filter [3]. Another approach for tracking is to use LMS and RLS adaptive filters. In noisy environments where the signal to noise ratio is low, LMS filter has better tracking ability in comparison to the RLS filter, but the fast convergence property of the RLS, has resulted to the new developments for this algorithm which enhances the capability and performance of this filter in non stationary environments [6-7]. Using the Kalman filter features which are an optimum linear tracker, for improving the RLS filter has led to better results [4]. Also numerous papers have used MAP and ML as filter that we refer to some of them: In a paper, the existence of noisy images in detection and locating moving objects has stimulated the use of MAP as estimator of the measurement data [5]. In reference [9], the estimation algorithm is proposed in computation of the motion vectors of the image pixels for mitigating of the noise effect. The performance of the MSE criterion in the BLUE estimator with ML fitness function is improved in reference [10]. In our previous paper [12], with the suitable combination of RLS and MAP algorithm, an RLS algorithm with filtered input was presented. In present work the mean square error of the proposed algorithm, suitable combination of RLS and MAP algorithm, is obtained and it can be proved that power of this error is the less than the RLS algorithm. It is also applied in tracking the vehicles in the traffic scene. In section 2 of the paper, we present the RLS and MAP algorithms and then in section 3, the proposed Modified RLS algorithm is presented and an algorithm for adjusting the MAP parameters, then its error will be obtained analytically. In section 4, this algorithm will be used in vehicles tracking at traffic scene. The result will be presented in the conclusion section
2 The Tracking Algorithm and the Estimator In this section, we review RLS and MAP algorithms. In this problem, the input is considered to include the original signal contaminated with additive noise. Where, y(n) is the original signal and servation signal which is noisy
is the additive noise. x(n) is the ob-
2.1 The RLS Algorithm This algorithm is used to find the transfer function, noise cancellation, finding the inverse system function and prediction. The purpose is to minimize the sum of the square of the error in the time domain. e(n) is the error in the n’th sample and (x(n) x(n – 1)...x(n – N)) is N buffered input signal.
Vehicle Tracking at Traffic Scene with Modified RLS
625
Where e(n) is the error signal and hi are the RLS filter coefficients. The goal of the RLS algorithm is to minimize V(n) (relation (3)) where V(n) is the sum of square of the error in time interval. In practice, it minimizes the time average of In each time interval, V(n) is minimized and for this purpose filter weights are change.
2.2 The MAP Algorithm The Bays estimator is a kind of statistical inference problem, which including the classic estimators such as MAP, ML, MMSE. The Bayesian estimation is based on minimizing the Bays risk function which includes an aposteriori model from unknown parameters and an error function. The MAP estimator with consideration of the noise PDF and the observation signal is seeking to maximize the likelihood of the original signal with regard to the observation signal [7]. The aposteriori probability, y(n) (the original signal) from (1) is as follows:
In this problem the noise is assumed to be Gaussian and
are
the variance and signal mean and noise average respectively. For obtaining the MAP estimation the derivation of the logarithm of the likelihood relative to the original signal y(n) is taken and is set to zero.
The result of estimation is
The estimated signal which we show it by
is obtained from combination or
linear interpolation of two weighted terms on non conditional mean of the original signal y(n) (i.e. and the difference between the observation signal and the noise average
626
H.S. Yazdi, M. Fathy, and A.M. Lotfizad
3 The Proposed Method MRLS In this section, early a method for reducing the noise effect is presented and then the noise statistics is computed by a recursive method.
3.1 MRLS Method RLS algorithm in the prediction configuration, tracks the noise present in signal together with the signal itself and does not the capability of suitable noise filtering. This is because of the filter is data-oriented and its output is a linear combination of the input signal. For canceling the noise algorithms such as Kalman filtering consider a model for the generation process and observing the signal. The Kalman filter in the made of tracking moving objects, demand a model of the object motion in linear form. If there is not a motion model or a process model, the application of this algorithm will be difficult. In contrast, the RLS algorithm does not need a model. On the other hand, this causes the vulnerability of the algorithm against the process and observation noise. Therefore, this algorithm, because of the nonexistence of the original signal information as the desired objective of the filter, performs the tracking of the signal x(n) instead of y(n). In this paper, the purpose is to track the signal (the estimation of the original signal) instead of the signal x(n). For this purpose, we use an estimator which does not exploit the process model. The MAP estimator having the statistics of the distribution functions statistics of the original signal and noise will be able to properly estimate the signal. Initially we assume the noise statistics to be constant and known and in the next section, with a recursive method, we will calculate them. The MAP estimator requires the mean and variance of the original signal and noise, but at tracking because of the existence of only one sample of the data can not calculate it. For this purpose, we predict M next state by the RLS predictor and save them. Then in each time step, we apply the input signal (with states that so far were predicted by the RLS filter) to the estimator, Fig. 1. As the block diagram of the system of Fig. 1 shows, the input signal x(n) is merged with the information taken from a 2-Dimmensional table (containing the performed predictions by the RLS filter in different stages of the observation of the input signal) and is applied to the MAP estimator as a data vector.
3.2 The Error in MRLS Algorithm In step m in data buffer, k samples which include k-1 predictions in cases where RLS has been convergent and the receive data, is available. From this data the MAP estimator, gives estimation according to (6). The population in the buffer gradually becomes more uniform, or the prediction in different steps refers to a single value. Hence, the standard deviation of the noise in data will be reduced, so that it can be assumed that
and
the square error is found and (6) is converted to (7).
in this case, the expected value of
Vehicle Tracking at Traffic Scene with Modified RLS
627
Fig. 1. The MRLS algorithm
Also, the noise statistics is obtained from the proposed recursive relation gives by (8).
Where,
is the noise mean in kth step and
is the convergence coefficient
which is between 0 to 1. Also MRLS error is obtained from (9). Since in MRLS algorithm, the input to the RLS is obtained from the MAP estimator output, therefore in computing of the RLS output error in (9), the relation (8) is used..
Where
N
is
are the RLS weight vector and
the
number
of
adaptive
filter
weights.
which is the noise averages in N instant of time so far. Is an estimation of
that can be written as
where
is
the approximation error (9) can be written as (10). Substituting (10) in (8) yields:
Where
is the conventional RLS error. With good approxima-
tion, (11) can be written as: Substituting
in terms of
to
We have:
Which by assumption less than one and large k, first term of (13) can be ignored and second term is partial of Newton polynomial which simplifies to:
628
H.S. Yazdi, M. Fathy, and A.M. Lotfizad
From (14) the following results can be obtained: A) In the limit B) We multiply both sides of (14) by
The first term is always less than 1 and error in conventional RLS. Hence we have:
and find its expected value.
is the expected value of the square
C) With regard to (14) we can deduce the inequality
regarding this
inequality we obtain the following expected value:
With regard to the obtained results the expected value of the squared error of MRLS algorithm is calculated as follows:
Using to result ( C ) (17) and applying it in (18), yields: And using the result (B) of (16) we will have: Relation (20) shows that the expected value of the squared error of the MRLS algorithm is less than the expected value of the squared error of the RLS by This amount of error is between zeros and also, regarding the result (A), the expected value of the noise mean tends towards the estimation of the system error. Therefore the direction of the motion of the noise means is corrects i.e. it is in line with error computation.
Vehicle Tracking at Traffic Scene with Modified RLS
629
4 Applying of MRLS in Vehicles Tracking at Traffic Scene In this section, we will show first the performance of the MRLS algorithm and its superiority in simulation on a chirp signal contaminated with additive noise with nonzero mean. Then we will use it as a tracker for vehicles in the traffic scene.
4.1 Test of MRLS on a Noisy Chirp Signal This section is devoted to the extraction of the prediction of the original signal from a noisy chirp signal (Fig.2) using RLS and MRLS. In Fig.2 it can be observed that the RLS algorithms has tracked the noise, while in the MRLS algorithm the original signal is being tracked. In the presented algorithm the noise mean is calculated using relation (8). The second norm of the MRLS algorithms in comparison to the conventional RLS in frequent experiments (100 runs) 20% is reduced. The change of noise mean calculated from (8) is given in Fig.3. It can be seen that with observing the new samples, the noise mean approaches to its true value. In the purposed algorithm for estimation of the original signal from noise, the noise is assumed to be Gaussian. In practice, with consideration of the problem conditions, the noise is to be found and the relations for estimation of the original signal from noise must be rewritten.
Fig. 2. The top picture is the signal tracking using RLS algorithm and the bottom picture is the signal tracking using the MRLS algorithm
Fig. 3. Changes of noise mean with using relation 8
630
H.S. Yazdi, M. Fathy, and A.M. Lotfizad
If the initial noise mean, is chosen very far from the reality, more iterations for convergence and estimation of the noise mean are needed. Also this issue is depicted in Fig.3.
4.2 Application of MRLS in Prediction of the Position in Tracking Vehicles in the Traffic Scene Tracking the vehicles in roads has a notable role in the analysis of the traffic scene. Generally in tracking the vehicles, the feature points or model in the consecutive frames are tracked, in other words, vehicles is detected initially [8] and then is followed at consecutive frames. A trajectory predictor is used for increasing the tracking precision reducing the size of search area for desired location in the image, avoiding loss of the vehicles because of the existence of the similar objects around it. In tracking the feature points [13,14], some special points that relate to the object are found and in tracking the area, the blobs which do not pertain to the background and have motion, are investigated [15]. In model-based tracking [16, 17], a 2 or 3-D model for the moving object is obtained and it is search in the next frames. This application performs the tracking of the colored moving blobs. The applied algorithm for tracking multiple objects inside the scene of an algorithm is based on prediction which resolves the problem of tracking the nearby vehicles [18]. After the detection of the vehicles, the similar blobs in to consecutive frames which are in close special position are found and the most similar blobs are attributed to each other. These locations are applied to an MRLS predictor in order that after the convergence for each blobs, help in attribution of similar blobs. The MRLS predictor corrects the improper attribution of blobs, due to their similarity. In this manner, after the arrival of each vehicle to the scene, it is labeled and it is tracked in the interest area in side the scene. The position of the centers of gravity of the two similar blobs obtained in two frames is given to an MRLS predictor to predict the next position gradually. The tracked trajectories by the RLS algorithm and MRLS are shown in Fig.4. It can be seen that the red path with the proposed filter is smoother than the path predicted by the RLS algorithm. For a better comparison of the two algorithms in Fig.5, the prediction of rows of the path of one of the cars with two algorithms, are shown.
Fig. 4. The predicted trajectory by RLS (white path) and by MRLS (red path)
Vehicle Tracking at Traffic Scene with Modified RLS
631
Fig. 5. Comparison of the predicted rows of the path of a car with two methods, Blue: RLS, Red: MRLS
5 Conclusions In this paper, with appropriate combination of MAP and RLS algorithms were presented an MRLS algorithm having the performance of both algorithms. It was proved that its error is reduced at least to the amount of the mean of the noise power. In tracking a chirp signal contaminated with noise, a reduction in error equivalent to 20% was obtained and in tracking vehicles, a smoother trajectory was predicted using this algorithm. In practical problems, the initial estimate of the noise mean is important in fast convergence of the algorithm.
References [1] S.Haykin, ,Adaptive Filter Theory,3rd-ed,Printice Hall,1996. [2] S.Gil, R.Milanese, T. Pun, “Comparing Features for Target Tracking in Traffic Scenes,” Pattern Recognition, Vol.29, No.8, pp. 1285-1296, 1996. [3] L.Zhao, C.Thorpe, “ Qualitative and Quantitative Car Tracking from a Range Image Sequence,” Proc. CVPR, Santa Barbara, CA, June 23-25, pp. 496-501,1998. [4] S.Haykin, A.H.Sayed, J.Zeidler, P.Yee, P.Wei,“Tracking of linear Time-Variant Systems,” Proc. MILCOM, pp.602-606, San Diego, Nov. 1995. [5] J.W.Lee, I.Kweon, “MAP-Based Probabilistic Reasoning to Vehicle Segmentation,” Pattern Recognition, Vol.31, No.12, pp.2017-2026, 1998. [6] B.Widrow, S.D.Stearns, Adaptive Signals Processing, Prentice Hall,1985. [7] S.Vaseghi, Advanced Signal Processing and Digital Noise Reduction, John Wiley & Sons Ltd, 1996. [8] P.G.Michalopoulos, “Vehicle Detection Video through Image Processing: The Auto scope System,” IEEE Transaction on Vehicular Technology, Vol.40, No.1, February 1991. [9] D-G.Sim, R-H.Park, “Robust Reweighted MAP Motion Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No.4, April 1998. [10] K.C.Ho, “A Minimum Misadjustment Adaptive FIR Filter,” IEEE Transactions on Signal Processing, Vol.44, No.3, March 1996. [11] C.F.N. Cowan, P.M.Grant, Adaptive Filters, Prentice-Hall, 1985. [12] H.Sadoughi.Yazdi, M.Lotfizad “A new approach for tracking objectives using combination of RLS and MAP algorithms, 11th,Iranian Electronic Eng. Conference vol.3, pp;258266, 2003. [13] D.Chetverikov, J.Verestoy, “Feature Point Tracking for Incomplete Trajectories,” Digital Image Processing, Vol.62, pp.321-338, 1999.
632
H.S. Yazdi, M. Fathy, and A.M. Lotfizad
[14] B.Coifman, D.Beymer, P.McLaunhlan, J.Malik, “A Real-Time Computer System for Vehicle Tracking and Traffic Surveillance,” Transportation Research Part C 6, 271-288, March 1998. [15] J.Badenas, J.M.Sanchiz, F.Pla, “Motion-Based Segmentation and Region Tracking in Image Sequence,” Pattern Recognition 34, pp.661-670, 2001. [16] D.Koller, K.Daniilidis, H.-H. Nagel, “Model-Based Object Tracking in Monocular Image Sequences of Road Traffic Scenes,” Similar Version Published in International Journal of Computer Vision 10:3, pp.257-281, 1993. [17] M.Haag, H.-H.Nagel, “Tracking of Complex Driving Maneuvers in Traffic Image Sequences,” Image and Computing 16, pp.517-527, 1998. [18] H.Sadoughi.Yazdi, M.Lotfizad, E.Kabir , M.Fathi “Application of trajectory learning in tracking vehicles in the traffic scene” 9th Iranian computer conference vol.1, pp. 180-187, Feb 2004. [19] S.Mantri, D.Bullock, “Analysis of Feed forward - Back propagation Neural Networks Used in Vehicle Detection,” Transportation Research C. Vol.3, No.3, pp.161-174, 1995. [20] Y.K.Jung,K.W.Lee,Y.S.Ho, “Content-Based Event Retrieval Using Semantic Scene Interpretation for Automated Traffic Surveillance,” IEEE Transaction Intelligent on Transportation System, Vol.2, No.3, September 2001.
Understanding In-Plane Face Rotations Using Integral Projections Henry Nicponski Eastman Kodak Company
[email protected]
Abstract. Because of the primacy of human subjects in digital images, much work has been done to find and identify them. Initial face detection systems concentrated on frontal, upright faces. Recently, multi-pose detectors have appeared, but suffer performance and speed penalties. Here we study solutions to the problem of detection invariance faced with in-plane rotation of faces. Algorithms based on integral projections and block averages estimate face orientation correctly within ±10° in about 95% of cases, and are fast enough to work in near real-time systems.
1
Introduction
The sources of variance in appearance of faces in images include identity, pose, illumination, and expression. Surprisingly, identity contributes less to the change in appearance than do the other factors. This fact – true when using almost any noncognitive measure of appearance similarity – seems counter-intuitive in light of the great facility of human recognition of individual persons, which might seem to imply substantial invariant aspects in the appearance of individuals. Sophisticated mechanisms of eye, retina, and brain underlie this seemingly effortless recognition ability. Artificial face detection systems show excellent invariance to individual identity of frontal faces [13]. We use two algorithms, A and B, trained on frontal faces after the methods of [2] and [1], respectively. In a test set of 600 images of diverse types, containing about 1000 faces, algorithm A found >90% of faces with two eyes visible; algorithm B found >80% running at five images/second on 900 MHz PC. For method A, 62 detection failures were due to excessive in-plane head rotation, and 32 failures to out-of-plane rotation. We wish now to detect the faces missed due to in-plane rotation. With ~90% detection rate yet only about 6% of faces lost to rotation, the challenge of our task becomes clear. If we detect all of the faces lost to in-plane rotation, yet reduce detection of upright faces by a few percent, no net gain will result. Also, we cannot forfeit the near-real-time speed on method B. Detection algorithms have performance dependent on the in-plane rotation of the face. Figure 1 shows the relative detection performance of out algorithms as a function of rotation. Algorithm A displays invariance to roughly ±20° of in-plane rotation, while algorithm B’s detection already drops 5-10% at ±10°. The performance of algorithm B is more quickly affected adversely by face rotation than A. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 633–642, 2004. © Springer-Verlag Berlin Heidelberg 2004
634
H. Nicponski
Moreover, it may not be known ahead of time which way is “up” in an image, with four orientations possible. In the problem of single-frame orientation, the goal is to determine upright image orientation. Faces provide the single best cue for orientation, but it is computationally expensive to perform face detection four times at 90° rotations. Therefore, we seek full 360° invariance to in-plane face rotation. The remainder of this paper is organized as follows. Section 2 surveys related work in the literature. Section 3 describes our algorithms for estimating in-plane rotation of faces. Section 4 gives experimental results, and Section 5 summarizes and discusses future work.
2
Related Work
Object rotations in plane result in rotated patterns in a digitized image, while out-ofplane rotations produce entirely different patterns. There have been two main strategies for achieving in-plane rotation tolerance for face detection: (1) examining the image at multiple rotation settings; and (2) estimating in-plane face rotation windowby-window, and applying upright face detection to a de-rotated image window.
Fig. 1. Relative face detection rates (%) of algorithms A (solid) and B (dashed) as a function of in-plane face rotation (degrees).
Many researchers handle cases of in-plane rotation by examining the image at multiple rotations (e.g., [6]). Typically, a few orientations symmetrically placed about the nominal are used (e.g., +30°, 0°, -30°), implicitly assuming the correct 90° rotation of the image is known. The additional orientations find heads tilted from upright. Speed will decrease, and false positive rate increase, by a factor equal to the number of orientations. Some researchers preface their face detectors with an estimator of face in-plane rotation. The estimator decides the most likely orientation of a face, should a face happen to be present, returning a meaningless result otherwise. Thus, in [3], an orientation module examines windows. The module binarizes the test window using an
Understanding In-Plane Face Rotations Using Integral Projections
635
adaptive threshold, and matches to a face model consisting of dark eye lines in onedegree increments, using the (computationally expensive) Hausdorff distance. The subsequent classifier examines an explicitly de-rotated test window. Similarly, [4] inserts a neural network rotation-estimation step prior to the main classifier. In both these methods the time cost of the rotation estimation likely approaches that of the classifier, leading to a substantial speed loss in the overall detection system. A different means of achieving in-plane rotation invariance belongs to methods that perform “detection-by-parts” [e.g., 5], which attempt to locate telltale substructures of the object. These substructures are grouped geometrically into full-object hypotheses and subjected to further testing before final classification. If substructure detection exhibits a tolerance to in-plane rotation, the tolerance propagates to full-face detection. Methods of this type have not yet shown detection performance that compares competitively with whole-face methods, and are typically much slower.
3
In-Plane Rotation Estimation Using Integral Projections
A very fast rotation pre-estimator has been developed, based on learning the patterns of rectangular integrals in the face region of upright facial images. For many applications of face detection, algorithm speed carries great importance. Recently, near – real-time face detectors have been created [1,6,7]. Of the two principal approaches to rotation, the pre-estimator offers the faster possible speed, if a fast calculation can provide the required estimation accuracy. Recent research [1] has used the concept of the integral image (II) [see Appendix] to devise fast computational methods. We use the II to extract convolution sums with rectangular kernels for in-plane orientation estimation. We compute N=58 sums to form a feature vector x for an image window.
3.1
Face Region Projections and Sums
The arrangement of facial features leads to typical intensity profiles across the vertical and horizontal directions of the facial region. In Figure 2, consider the sums of the horizontal and vertical lines of pixel intensity values in the circular central face region. It seems reasonable the sums would show consistent differences, due to the relatively uniform placement of eyes, mouth, and cheeks. In the top row of Figure 3 are 10° rotated versions of the central face region, followed by two rows with the horizontal and vertical sums of the intensity pixels, normalized by pixel count. We make the following observations. First, in the upright face position, there is a clear difference in the horizontal and vertical sums. The vertical sum (third row) exhibits a symmetry that reflects the left-right physical symmetry of the face. The horizontal sum shows a strong minimum in the position of the eyes and a less pronounced minimum at the mouth position. Second, as the face rotates, these distinct signatures become mixed together until, at the 45° orientation, they are virtually identical. Third, at 90° the sums would be the same as for the upright data but with horizontal and vertical sums exchanged. There is a pseudo 90°and 180°- periodicity in the sums. The central idea of orientation estimation will involve training a machine-learning algorithm to recognize the typical pattern of the
636
H. Nicponski
sums in the upright face orientation (left column of Figure 3) and to distinguish that pattern from those of the other orientations.
Fig. 2. The “average face”, circularly masked central region, and six rectangular regions.
In the bottom row of Figure 3, six additional convolution sums were computed, over the six rectangular facial regions in Figure 2. These regions were chosen due to the predominant failure mode that appeared when training the learning algorithm using only the horizontal and vertical integrals, namely erroneous estimation of the face orientation by exactly 180° (or, to a lesser extent, 90°). The six regions, covering the eyes and mouth, yield very different typical sums when the face is rotated 90° or 180°.
3.2
Probability Models
We estimate face orientation using the extracted data in two steps: first we apply a linear transformation to reduce the dimensionality of the data, then we perform a maximum a posteriori (MAP) estimation of the most likely orientation using a probability density model. We describe both Gaussian density model and Gaussian mixture model as suitable models. We will estimate the likelihood of the observed data given the premise that an upright face is present, and do so at k=36 evenly spaced rotations, every 10°. It is not necessary to rotate the image to those k orientations, but, rather, to compute what the II would be of those rotated images. The integral image calculation is very fast, being linear in the number of image pixels. Note that the II computations exhibit a kind of pseudo-periodicity with respect to 90° rotations that permits only k/4 IIs to actually be computed (see Appendix). Dimensionality Reduction. It is desirable to reduce the redundancies in visual information. The principal components analysis (PCA) creates a linear transformation into a related representational space in which the fewer dimensions are statistically independent, according to where x and y represent test data vectors in original and low dimensional space, the column matrix of most significant eigenvectors of the data covariance matrix, and the data mean. The transformation has the advantage of high speed. The profiles in Figure 3 show the presence of redundancy by the slowly varying nature of the middle two rows. A PCA can be easily computed (e.g., [8]), to enable the estimation of more reliable statistical models. In
Understanding In-Plane Face Rotations Using Integral Projections
637
this work, we used d=20 or d=40 for the dimension of the PCA subspace into which the data are transformed, from the original representation of the sums with N=58 dimensions.
Fig. 3. Rotated faces (top row); horizontal- (second row), vertical- (third row), and regionalfourth row) face pixel sums.
Gaussian Probability Model. A Gaussian probability density model was used to estimate the likelihood of a data observation, given the assumption that an upright face was present. In this model, we start with the standard Gaussian form
where represents the class of faces, the data covariance matrix, and the mean of the Gaussian. We substitute an approximation for Equation (1) in which the principal components y are combined with the reconstruction residual to arrive at a twocomponent estimate of likelihood [8]
where is an estimate of the average of the N-d least significant eigenvalues, the matrix of most significant eigenvectors of in the columns, the are the d most significant eigenvalues, and the reconstruction residual of the test datum x. The
638
H. Nicponski
estimate is conventionally made by fitting the eigenspectrum of to a nonlinear function. By considering the reconstruction residual, this formulation explicitly takes into account the possibility of a explanation for the observed data other than the presence of the desired object. It has been shown [8] that Equation 2 leads to improved density estimations in applications to facial imaging. Gaussian Mixture Model. As face poses depart increasingly from frontal, the Gaussian density model becomes less accurate. We therefore consider a richer model, the mixture model having M components. Such a model can approximate any continuous density to arbitrary accuracy for many choices of the component density functions [9]. The mixture model has the form
subject to the constraints of probability
Each of the M components is a diagonal-covariance Gaussian model after
with y again being the low dimension representation of datum x. (Subscripts indicate vector components; superscripts match variances to mixture components). We drop the residual term in Equation (2) to simplify parameter estimation. Training of the mixture model for density estimation requires iterated estimates of the parameters of the Gaussian components and the prior component probabilities P(j). The expectation-maximization (EM) method [9] uses the analytical derivatives of the data likelihood with respect to the model parameters (P(j),j = 1..M, and to iteratively improve estimates of the parameters in a gradient-descent framework.
3.3
Estimating Orientation
We summarize the process of estimating face orientation. A rotation-estimation module operates as a step in a global image face search. The search process looks across locations and scales in an image to find faces of different sizes and locations. Zeroing in on a single position and scale, we consider how to estimate face orientation, should a face be present. Integral images are computed for the image at k/4 rotations spaced over 90°, leading to 10° spacing, using bi-cubic interpolation. To examine a test window centered at (r,c) with scale s to determine whether a face be
Understanding In-Plane Face Rotations Using Integral Projections
639
present, the first task will be to estimate its rotational orientation. Using the IIs, we apply the probability model k/4 times by extracting the face region projections and sums at the proper location, taking into account the scale. These data undergo dimensionality reduction and are plugged into the probability model [Equations (2) or (3)] to obtain the likelihood conditioned on face presence. The procedure is repeated three times with orthogonal rotations of 90°, 180°, and 270°, advantage being taken of the periodicity properties of II, according to Table 2. In this way, k likelihood values for face presence are estimated. True rotational angle is judged to be the angle that gives rise to the highest likelihood. Figure 4 shows a typical pattern of likelihood at different orientations for a face whose true orientation was 230°. The likelihood peaks at 130°, the amount of de-rotation needed to make the face upright. The response profile also shows a secondary peak, 180° away from the primary maximum. This common occurrence led to the introduction of the additional six rectangle measurements (shown in Figure 2) in the data vector. Figure 4 also shows example failures of rotation estimation, with some common causes - occlusion, hats, low contrast, and beards.
4
Results
A set of 4939 facial images, with resolution 56 × 56 pixels and eyes manually aligned, was used for training and testing. The images came from a variety of different sources. In general, algorithms were tested using a five-fold cross-validation strategy: four-fifths of the image set was used to train, while the other one-fifth was used to test; and the process was repeated five times. Every image served as a test object at least once, and no training image doubled as a test image.
Fig. 4. (Left) Computed log-likelihood of upright face presence from Gaussian model, by derotation angle. (Right) Example faces causing orientation estimation errors. Arrows indicate estimated upward direction.
A Gaussian model-based rotation estimator was trained and tested with the fivefold cross-validation strategy, with results shown in Table 1. The test images all had true upright orientation. In a second test, one of the five sets was used to probe the
640
H. Nicponski
algorithm with rotated test images. The test images were each rotated to the 36 angular positions, spaced every 10°, and their rotational orientation was estimated. Results, given in Figure 5, show little effect of the true face orientation on estimation accuracy. A Gaussian mixture model with three components was trained to estimate rotation. Groups of 1218 faces each were used to initialize the mixture components, containing left-, frontal, and right-facing heads, with the hope that the components would each specialize in modeling one of the head-pose ranges. Following training, the component probabilities P(j) in Equation (3) were 0.24, 0.50, and 0.26. Table 1 shows the results of the five-fold cross validation experiment. The rotational sensitivity test was also repeated for the Gaussian mixture model (Figure 5). Both models performed consistently across all 36 angular positions, with and respectively.
5
Discussion and Future Work
We present two algorithms to estimate the in-plane rotation of faces. The design goals were a combination of very high estimation accuracy and speed. Both methods perform at levels of 95% within ±10° of the true orientation and are fast enough for near real-time face detection systems. They use Gaussian and Gaussian mixture probability models. The mixture model was adopted to try to manage the complexities of head rotation out-of-plane. Measured on the test set of images used here, the performance of the two models is almost indistinguishable. The Gaussian model performs slightly better but with larger variance; it is also computationally less expensive, slowing our fast detector only slightly from five to four images per second. Opportunities for improvements in these algorithms include better feature extraction and better probability density modeling. More or different features can be used to solve the typical failure modes shown in Figure 4. It is desirable to continue to base features on the II to maintain its speed advantage. AdaBoost feature selection schemes [10] could be applied here. A very recent SVM probability-density estimation method, with reduced set support [11], shows advantage compared to our simple
Understanding In-Plane Face Rotations Using Integral Projections
641
Gaussian mixture model. Of concern, however, would be the greater computational cost to evaluate the point-probability densities.
Fig. 5. Accuracies of Gaussian (solid) and Gaussian mixture model (dashed) as a function of actual face rotation.
References 1. Viola, P., and Jones, M., “Robust Real-Time Object Recognition,” Proc. Second Int. Workshop on Statistical and Computational Theories of Vision – Modeling, Learning, Computing, and Sampling 2001. 2. Schneiderman, H., “A Statistical Approach to 3D Object Detection Applied to Faces and Cars,” Proc. IEEE Conf. Computer Vision and Pattern Recognition 2000. 3. Jeon, B., Lee, S., and Lee, K., “Face Detection using the RCE Classifier,” Proc. IEEE Int. Conf. Image Processing, II-125-II-128, 2002. 4. Rowley, H., Baluja, S., and Kanade, T., “Rotation Invariant Neural Network-Based Face Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 38–44, 1998. 5. Heisele, B., Ho, P., Wu, J., and Poggio, T., “Face recognition: component-based versus global approaches,” Computer Vision and Image Understanding, 91, 6–21, 2003. 6. Li, S., Zhu, L., Zhang, Z., Blake, A., Zhang, H., and Shum, H., “Statistical Learning of Multi-view Face Detection,” Proc. European Conf. Computer Vision, 67-81, 2002. 7. Chen, S., Nicponski, H., and Ray, L., “Distributed Face Detection System with Complementary Classifiers,” Proc. Assoc. Intelligent Machinery Joint Conf. Information Sciences, 735–738, 2003. 8. Moghaddam, B., and Pentland, A., “Probabilistic Visual Learning for Object Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7), 696–710, 1997. 9. Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press: Oxford, 1995. 10. Freund, Y., and Schapire, R., “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” J. Computer and System Sciences, 55(1), 119–139, 1997.
642
H. Nicponski
11. Girolami, M., and He, C., “Probability Density Estimation from Optimally Condensed Data Samples,” IEEE Trans. Pattern Analysis and Machine Intelligence, 25(10), 1253– 1264, 2003. 12. Simard, P., Bottou, L., Haffman, P., and LeCun, Y., “Boxlets: a Fast Convolution Algorithm for Signal Processing and Neural Networks,” Advances in Neural Information Processing Systems, 11, 571–577, 1999. 13. Yang, M., Kriegman, D., and Ahuja, N., “Detecting Faces in Images: A Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, 24(1), 34–58, 2002.
Appendix: Integral Image The integral image (II) is computed linearly in time with the count of image pixels; and enables convolution sums over rectangular kernels to be computed in constant time. The II of intensity image I(i,j) is defined in the discrete domain by
Thus, the II value at entry (i,j) is the summation of all image pixels above and to the left of (i,j), inclusive. Any arbitrary rectangular convolution sum can be computed with four memory accesses to the II and three additions [1, 12]. The II has a pseudo-periodicity with respect to orthogonal rotation. From an II for the nominal orientation, the IIs of the other three orientations can be derived. Values of the II are sums over rectangles with one corner fixed at the image origin. Since rectangles remain rectangles upon orthogonal rotation, IIs of such rotated images are redundant. Equivalences are shown in Table 2, in which a rectangle anchored at of size w x h, rotated about center of rotation is transformed with new parameters expressed in the coordinate system of the nominal II.
Feature Fusion Based Face Recognition Using EFM Dake Zhou and Xin Yang Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China {normanzhou, yangxin}@sjtu.edu.cn
Abstract. This paper presents a fusing feature Fisher classifier approach for face recognition, which is robust to moderate changes of illumination, pose and facial expression. In the framework, a face image is first divided into smaller sub-images and then the discrete cosine transform (DCT) technique is applied to the whole face image and some sub-images to extract facial holistic and local features. After concatenating these DCT based facial holistic and local features to a facial fusing feature vector, the enhanced Fisher linear discriminant model (EFM) is employed to obtain a low-dimensional facial feature vector with enhanced discrimination power. Experiments on ORL and Yale face databases show that the proposed approach is superior to traditional methods, such as Eigenfaces and Fisherfaces .
1
Introduction
Face recognition (FR) techniques could be generally categorized into two main classes [1]: 1) feature-based methods, which rely on the detection and characterization of individual facial features (i.e., eyes, nose, and mouth etc.) and their geometrical relationships; 2) holistic-based methods, which are the template matching approaches based on the whole facial information. Motivated by the need of surveillance and security, telecommunication and human-computer intelligent interaction, FR techniques have got a great development in the past two decades, but there are still some problems [2]. A significant one is that most FR approaches perform poorly or even cannot work under various conditions, such as changing illumination, pose, and facial expression. An approach to this problem may be to use facial holistic as well as local information for face recognition, which is inspired by the fact that both holistic and local information are necessary for human recognition of faces [2,3]. In Ref. [4,5], eigenfaces plus eigenfeatures (eigeneyes and eigennose) is used to identify face, which leads to an expected improvement in recognition performance. This approach, however, has two limitations: 1) it does not use class information, as it is only based
The work was partially supported by National Natural Science Foundation of China (No.30170264), and National Grand Fundamental Research 973 Program of China (No.2003CB716104). A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 643–650, 2004. © Springer-Verlag Berlin Heidelberg 2004
644
D. Zhou and X. Yang
on principal component analysis (PCA) technique; 2) it needs accurate facial features (eyes and nose) detection, which is very difficult in practice. The main objective of this research is to improve the accuracy of face recognition subjected to varying facial expression, illumination and pose. In this paper, a fusing feature Fisher classifier approach is proposed for face recognition, which is robust to moderate changes of illumination, pose and facial expression. In the framework, a face image is first divided into smaller sub-images and then the discrete cosine transform (DCT) technique is applied to the whole face image and some subimages to extract facial holistic and local features. After concatenating these DCT based facial holistic and local features to a facial fusing feature vector, the enhanced Fisher linear discriminant model (EFM) is employed to obtain a low-dimensional facial feature vector with enhanced discrimination power. Finally, the nearest neighbor (to the mean) rule with Euclidean distance measure is used for classification. Experimental results on ORL and Yale face databases show that the proposed approach is more robust than traditional FR approaches, such as Eigenface and Fisherfaces.
2
DCT Based Face Representation
Among various deterministic discrete transforms, the DCT best approaches to Karhunen-Loeve transform (KLT), which is widely used for feature extraction in FR community. Additionally, the DCT can be computed more efficiently than the KLT because it can be implemented by using fast Fourier transform algorithm [6]. Therefore, we employ DCT for face representation, i.e., a low-to-mid frequency subset of the 2-dimensional (2-D) DCT coefficients of a face image is extracted as the facial global feature, which is similar to that used in Ref. [7]. In this paper, a square subset is used for the feature vector. The size of this subset is chosen such that it can sufficiently represent a face, but it can in fact be quite small, as will be shown in our experiments.
Fig. 1. A face image (Left) and its local regions of eyes and nose (Right).
The similar technique is used to extract facial local information. By considering facial structure and the size of face image, we first divide the whole face image roughly into several small-overlapping sub-images, such as the forehead, eyes and nose sub-images etc. Obviously, the regions of eyes, nose and mouth are the most salient regions for face recognition [1]. However, since the mouth shape is very sensitive to changes of facial expression, the mouth region is discarded and only the eyes and nose regions are used in this paper. DCT is then used to the two sub-images
Feature Fusion Based Face Recognition Using EFM
645
to extract local information. Fig. 1 shows a face image and its local regions of eyes and nose. Let denote the facial holistic feature vector, the eyes and nose regions feature vectors, respectively. Thus, can be defined as follows:
where
denotes the 2-D DCT, f,
and
denote the face image, eyes and nose
sub-images, respectively, Reshape(A, n) is a function that extracts the topleft n × n square matrix from matrix A and then transforms this square matrix into a column vector. A new feature vector is then defined as the concatenation of Therefore, the corresponding facial fusing feature vector Y can be derived from
by:
where
is the mean vector of training vectors, n is the number of training
samples,
consists of
is the j-th component of the standard
deviation of training vectors and k is the dimensionality of vector Y.
3
Fusing Feature Fisher Classifier
In the process of the DCT based facial fusing feature extraction, however, the class information is not used. To improve its classification performance, one needs to process further this fusing feature with some discrimination criterion.
3.1 Fisher Linear Discriminant Analysis Fisher linear Discriminant Analysis (FLD), which is also referred to as Linear Discriminant Analysis (LDA), is one of the widely used discrimination criterion in face recognition [8,9]. The basic idea of the FLD is to seek a projection that maximizes the ratio of the between-class scatter and the within-class scatter. Let and denote the within- and between- class scatter matrices, respectively. The goal of FLD is to find a projection matrix W that maximizes the Fisher criterion function J(W) defined as:
646
D. Zhou and X. Yang
The criterion function J(W) is maximized when W consists of the eigenvectors of the matrix One main drawback of FLD is that it requires large training sample size for good generalization. When such requirement is not met, FLD overfits to the training data and thus generalizes poorly to the new testing data. For the face recognition problem, however, usually there are a large number of faces (classes), but only a few training samples per face. One possible remedy for this drawback is to artificially generate additional data and then increase the sample size [8]. Another remedy is to balance the need for adequate signal representation and subsequent classification performance by using sensitivity analysis on the spectral range of the within-class eigenvalues, which is also referred to as enhanced Fisher linear discriminant model (EFM) [10].
3.2 Enhanced Fisher Linear Discriminant Model The enhanced Fisher linear discriminant model (EFM) improves the generalization capability of FLD by decomposing the FLD procedure into a simultaneous diagonalization of the within- and between- class scatter matrices. The simultaneous diagonalization is stepwisely equivalent to two operations: whitening the within-class scatter matrix and applying PCA on the between-class scatter matrix by using the transformed data [10]. The EFM first whitens the within-class scatter matrix:
where
is the eigenvector matrix of
I is the unitary
matrix and is the diagonal eigenvalue matrix of with diagonal elements in decreasing order. During the whitening step, the small eigenvalues corresponding to the within-class scatter matrix are sensitive to noise, which causes the whitening step to fit for misleading variations. So, the generalization performance of the EFM will degenerate rapidly when it is applied to new data. To achieve enhanced performance, the EFM keeps a good tradeoff between the need of adequate signal representation and generalization performance by selecting suitable principal components. The criterion of choosing eigenvalues is that the spectral energy requirement (which implies that the selected eigenvalues should account for most of the spectral energy) and the magnitude requirement (which implies that the selected eigenvalues should not be too small, i.e., better generalization) should be considered simultaneously. Suppose eigenvalues set are selected, which is based on this selection criterion, thus m can be determined as the largest integer that satisfies the following equation:
where
and are the preset thresholds.
Feature Fusion Based Face Recognition Using EFM
647
We can then obtain matrices and The new between-class scatter matrix can be defined as follows:
Now EFM diagonalizes the new between-class scatter matrix:
where are the eigenvector and the diagonal eigenvalue matrices of respectively. Finally, the overall transformation matrix of the EFM can be defined as follows:
3.3 Classification Rule and Similarity Measures When a face image is presented to the classifier, the DCT based fusing feature vector Y of the image is first calculated and then the EFM is applied to obtain a lower dimensional feature vector Z with enhanced discrimination power: Finally, the nearest neighbor (to the mean) rule and Euclidean similarity (distance) measure are used for classification:
4
Experiments
We use the publicly available ORL and Yale face databases to test the proposed approach, with the consideration that the first database is used as a baseline study while the second one is used to evaluate face recognition methods under varying lighting conditions. The ORL database contains 400 face images of 40 distinct subjects. Ten images are taken for each subject, and there are variations in facial expression, details and pose. The images are 256 grayscale levels with a resolution ofl12×92. The Yale database consists of 165 face images of 15 subjects. There are variations in facial expression, details and illumination. The original images are 256 grayscale levels with a resolution of160×121. Note that for the images in Yale database, before they are used in our experiment, they are normalized to the size of 100 ×100 by using the geometrical normalization technique suggested by Brunelli et al. [11] and the histogram equalization technique. We first investigate that how many DCT coefficients should be used for facial holistic representation by performing the DCT based holistic method on ORL database with the “leave-one” strategy. That is, for each of the 400 images in the database, the closest match from the remaining 399 was found. This was repeated for various numbers of DCT coefficients, and the result is shown in Tab. 1. One can see that the recognition accuracy is very high while the number of the DCT coefficients
648
D. Zhou and X. Yang
used is in range [25, 81]. If more DCT coefficients are used, there is a slight decrease in recognition accuracy because these additional coefficients corresponding to the details of images are more sensitive to noise. Therefore, we use 49 coefficients to represent facial holistic information by considering both recognition accuracy and computational cost. Similarly, we will use 16 coefficients to represent the each local information in following experiments.
Fig. 2. Classification rate vs. Feature dimensionality (ORL database).
The next series of experiments on ORL database studied the comparative performance of several methods, including the proposed method, the EFM method, the well-known Eigenfaces and Fisherfaces. Fig.2 shows the classification rate curves of the four methods with respect to the dimensionality of features while 5 face images per person are selected randomly for training. The proposed method outperforms than the other three methods. In particular, our method achieves 95.6% recognition accuracy while only 29 features are used. The classification rate curves of the four methods are also shown in Fig.3 as functions of the number of training samples per person. One can see from this figure that our proposed method also performs the best among the four methods.
Feature Fusion Based Face Recognition Using EFM
649
Fig. 3. Classification rate vs. Number of training samples per person (ORL database).
To evaluate the recognition performance under varying lighting conditions, we performed last series of experiments on Yale database with 5 training samples per person. The proposed method also outperforms than the other three methods as shown in Tab.2.
5
Conclusions
This paper introduces a fusing feature Fisher classifier for face recognition, which is robust to moderate changes of illumination, pose and facial expression. The key to this method is to use the enhanced fisher linear discriminant model (EFM) to a fusing feature vector derived from the DCT based facial holistic and local representations. The facial fusing feature, encompassing a coarse (low-resolution)
650
D. Zhou and X. Yang
facial global description augmented by additional (high-resolution) local details, can represent face more robustly. Using the EFM, developed from the traditional fisher classifier with improved generalization capability, can further improve the robustness of the approach. The feasibility of the proposed approach has been successfully tested on ORL and Yale face databases, which is acquired under varying pose, illumination and expression. Comparative experiments on these two face databases also show that the proposed approach is superior to traditional methods, such as Eigenfaces and Fisherfaces. Another advantage of our approach is its lower computational complexity during training stage. Our approach uses DCT, other than KLT, to extract facial fusing feature, which means that our approach can be computed more efficiently than the entirely statistics based methods, especially while running in a large face database.
References Chellappa, R., Wilson, C.L., Sirohey, S.: Human and machine recognition of faces: a survey. Proc. IEEE, Vol. 83 (1995) 705–740 Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face recognition: a literature 2. survey. (2000) http://citeseer.nj.nec.com/374297.html 3. Sukthankar, G.: Face recognition: a critical look at biologically-inspired approaches. (2000) Technical Report: CMURI- TR-00-04, Carnegie Mellon University, Pittsburgh, PA. 4. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object representation. IEEE Trans. Patt. Anal. Mach. Intel. Vol. 19 (1997) 696-710. 5. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, WA, (1994) 84-91. 6. Rao, K. and Yip, P.: Discrete Cosine Transform-Algorithms, Advantages, Applications. (1990) Academic: New York, NY 7. Hafed, Z.M. and Levine, M.D.: Face Recognition Using the Discrete Cosine Transform. International Journal of Computer Vision, Vol. 43 (2001) 167–188 8. Etemad, K. and Chellappa, R.: Discriminant analysis for recognition of human face images. Journal of the Optical Society of America A: Optics Image Science and Vision, Vol. 14 (1997) 1724-1733 9. Jain, A.K., Duin, R., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Patt. Anal. Mach. Intell., Vol. 22 (2000) 4-37. 10. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Processing, Vol. 11 (2002) 467-476 11. Brunelli, R., Poggio, T.: Face recognition: Features versus templates. IEEE Trans. Patt. Anal. Mach. Intell., Vol. 15 (1993) 1042-1052. 1.
Real-Time Facial Feature Extraction by Cascaded Parameter Prediction and Image Optimization Fei Zuo1 and Peter H.N. de With2 1
2
Eindhoven University of Technology, Faculty E-Eng., 5600MB Eindhoven, NL LogicaCMG/Eindhoven Univ. of Technol., P.O.Box 7089, 5605JB Eindhoven, NL
[email protected]
Abstract. We propose a new fast facial-feature extraction technique for embedded face-recognition applications. A deformable feature model is adopted, of which the parameters are optimized to match with an input face image in two steps. First, we use a cascade of parameter predictors to directly estimate the pose (translation, scale and rotation) parameters of the facial feature. Each predictor is trained using Support Vector Regression, giving more robustness than a linear approach as used by AAM. Second, we use the generic Simplex algorithm to refine the fitting results in a constrained parameter space, in which both the pose and the shape deformation parameters are optimized. Experiments show that both the convergence and the accuracy improve significantly (doubled convergence area compared with AAM). Moreover, the algorithm is computationally efficient.
1
Introduction
Accurate facial feature extraction is an important step in face recognition. Our aim is to build a feature-extraction system that can be used for face recognition in embedded and/or consumer applications. This application field imposes additional requirements in addition to feature extraction accuracy, such as real-time performance under varying lighting conditions, etc. One promising technique for facial feature extraction is to use a deformable model [5], which can adapt itself to optimally fit to individual images while satisfying certain model constraints. The constraints can be derived from the prior knowledge about the object properties (e.g. shape and texture). The feature extraction process can then be seen as an optimization process, where the model parameters are adjusted to minimize a cost function for fitting. In earlier research [1], a parameterized deformable template is used for facial feature extraction. However, it is computationally expensive and the convergence is not guaranteed. Recently, the Active Shape Model (ASM) and Active Appearance Model (AAM) [2] have been proposed as two promising techniques for feature extraction. The ASM fits a shape model to a real image by using a local deformation process, constrained by a global variance model. However, the A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 651–659, 2004. © Springer-Verlag Berlin Heidelberg 2004
652
F. Zuo and P.H.N. de With
ASM searches for the ‘best-fit’ for each landmark independently, which sometimes leads to unstable results. The global constraints can maintain a plausible shape, but they cannot ‘correct’ the wrong local adjustments. The AAM incorporates global texture modelling giving more matching robustness, but we have found that the linear model parameter prediction used by AAM only works well for very limited ranges. The deformable model fitting can also be solved by applying a general optimization algorithm, which gives accurate fitting results, provided that the cost function is appropriately defined and its global minimum is found. However, the crude use of such a technique either leads to erroneous local minima (when a local optimization algorithm such as the gradient-descent algorithm is applied) or takes too much computation cost (when a global stochastic optimization algorithm such as the genetic algorithm is applied). In this paper, we propose a novel model-based facial feature extraction technique, employing both fast parameter prediction and direct optimization for each individual image. The used feature model is a variant of the statistical model in [2]. The fitting of the model to a real image is performed in two steps. First, a cascade of parameter predictors are used to estimate in a single step the ‘correct’ pose parameters (translation, scale and rotation). Second, a general optimization algorithm is used to further improve the extraction accuracy. In our case, a Simplex algorithm [8] is adopted to jointly optimize the pose and the shape deformation parameters. The aim is to obtain fast and accurate feature extraction results, which may enable re-usage in the face-recognition stage.
2 2.1
Statistical Feature Model Feature Model with Extended Shape and Texture Structure
Motivated by ASM and AAM, we build our statistical feature model by incorporating both shape and texture information. The geometrical shape of a facial
Fig. 1. The feature model.
feature (e.g. an eye) can be represented by a set of discrete feature points where N is the number of the feature points. In contrast with ASM/AAM, where only corners and/or contour points are selected, we use an extended set of feature points covering a larger texture region. To this end, we introduce a set of auxiliary outer feature points (Fig. 1(a)), which
Real-Time Facial Feature Extraction
653
can be derived from the original feature points FP (inner feature points) by extending each to a neighboring point in the direction perpendicular to the contour curvature. The extension range is proportional to the size of the feature. Although the outer feature points depend on the inner points and provide no new shape information, they encapsulate a larger texture region and incorporate more information (both inner texture and the surrounding texture). Based on the extended set of feature points, a Delaunay Triangularization is performed to construct a mesh over the feature region (see Fig. 1(b)). The triangular mesh is used for the texture warping to a standard shape (see Section 2.2). 2.2
Generic PCA-Based Feature Model
To obtain a feature model that can adapt to individual shape variations, we adopt Principal Component Analysis (PCA) from ASM/AAM to model the shape variations. Suppose matrix contains L largest eigenvalues after the PCA decomposition, then any normalized (w.r.t position, scale and rotation) shape vector can be approximated by
where is the mean normalized shape, and vector b defines a set of deformation parameters for the given class of features. If the geometric transformation (translation, scale and rotation) is incorporated, any feature shape vector x (not normalized) can be modelled using the normalized mean shape and a parameter set including both pose and deformation parameter b by
In Equation (2), denotes the geometric transformation by translation scaling and rotation Based on the shape information, the texture overlayed by the shape can be sampled and warped to a mean shape by piecewise affine warping (Fig. 1(b) and (c)). The texture samples are then scanned on line basis and reordered into one vector t, which is normalized by mean and standard deviation. Given the feature model, the feature extraction in a new image can be formulated as a parameter estimation problem. The optimal shape parameters need to be located, so that the texture region covered by the estimated shape has the minimum matching error with a normalized template texture
3 3.1
Model Fitting by Prediction and Optimization Overview of Model Fitting
We search for an optimal set of model parameters for a new image by taking the following two steps: pose parameter prediction and direct local optimization.
654
F. Zuo and P.H.N. de With
Motivated by AAM, we utilize the prior knowledge of the properties of the feature and its neighboring areas. A set of learning-based predictors are trained, which are able to directly predict the pose parameters given the incorrectly placed shape. Our prediction scheme has two distinct features. 1) We use Support Vector Regression (SVR) [3] to train the parameter predictors. Due to its nonlinearity, the SVR prediction is more reliable and robust than the limited linear prediction used by AAM. We have found in our experiments that the SVR is able to predict the model parameters correctly, even for very large pose deviations. 2) We use a cascade of SVR predictors to boost the prediction accuracy. We have found that the SVR predictors trained with varying pose variation ranges lead to different prediction errors. The cascading of these predictors can ‘pull’ the parameters to the correct position in a step-wise manner. Parameter prediction quickly finds the approximately correct pose parameters. At the second stage, we use a direct image optimization of both the pose and deformation parameters within a small constrained area, based on the SVR prediction statistics.
Fig. 2. Feature vector for SVR.
3.2
Cascaded Prediction by Support Vector Regression
The cascaded prediction involves the following aspects. Feature vector preparation. A reduction of the dimensionality of the texture vector decreases the training efforts and the computation complexity. Therefore, we extract the vertical and horizontal profiles from the normalized texture region and use the combined profile vector v for texture representation. In our experiments with eye extraction, the dimensionality of the feature space is reduced from 1700 to 100, giving more reliable training results and faster processing. Parameter prediction using Support Vector Regression (SVR). Given an initial shape vector x and its associated profile vector v, a geometric transformation correction can be applied to x for an optimal fit to the image. We try to build a prediction function for the geometrical transformation to deform the shape towards the actual feature, thus, We obtain prediction function by support vector regression. The SVR uses kernel functions to map data to a higher dimensional space and thus achieves
Real-Time Facial Feature Extraction
655
nonlinear mapping. For each training image, we randomly displace each vector element of p from the manually annotated known optimal value p to and obtain the displaced shape and its corresponding profile vector We then use the training set to train an function [4], where
Fig. 3. SVR-based prediction vs. the actual parameter deviation. The total length of each vertical error bar corresponds with two standard deviations.
Experiments for the prediction. In our experiment, we used a face set composed of 37 labelled face images [7]. We randomly selected 27 images for training and the remaining 10 images were used for testing. For each training image, we randomly perturbed all the pose parameters and collected 100 data samples. The perturbation range is shown in Table 1. For we used the Radial Basis Function kernel, and all the SVR parameters were selected by cross-validation. We tested the learned prediction function on the test set. The experimental results are shown in Fig. 3. It can be seen that the SVR-based prediction gives good results even for very large parameter deviations. The prediction error is distributed uniformly for various parameter displacements. For comparison, the prediction accuracy of the linear prediction scheme as used by AAM deteriorates sharply when the parameter displacement exceeds a small value, e.g. a horizontal displacement of only ±20% of the shape width. Our system performs better because the SVM-based approaches are more flexible and can learn and adapt to the complexity of the problem.
656
F. Zuo and P.H.N. de With
Cascaded prediction scheme. Although the use of SVR yields robust prediction results for large parameter perturbations, the prediction accuracy with small parameter displacements is still not satisfactory. The use of an iterative approach [2] will not give much gain, since the prediction error over different parameter displacements mostly remains the same. However, if a second prediction function is applied with smaller capture range but higher accuracy, then the error of the final prediction can be significantly reduced. To this end, we propose a cascaded prediction approach in which a set of SVR functions are trained over varying data perturbation ranges. These cascaded functions form a prediction chain. The initial functions in the chain are trained with large parameter displacements but only have coarse prediction accuracy. On the other hand, the succeeding functions are trained with smaller parameter displacements but have approximately double accuracy. With this prediction chain, the incorrectly displaced model parameters can be gradually ‘pulled’ to the correct position. In practice, the prediction chain only contains a few SVR functions. In our case, three SVR functions are used (more does not improve), each of which is trained over a training set by halving the perturbation range of the previous one. Fig. 4 shows the prediction performance of the second and third functions for horizontal prediction. In Section 4, we provide experimental results that demonstrate the effectiveness of the cascaded prediction scheme.
Fig. 4. The horizontal(x)-prediction performance trained over halved perturbation ranges (left: the 2nd function, right: the 3rd function).
Real-Time Facial Feature Extraction
3.3
657
Improving Accuracy Using Direct Local Optimization
The prediction results achieved in the previous section largely depend on the feature appearance in the training set. The use of prior knowledge leads to a fast and robust ‘jump’ to the right position. However, it is not well adapted to individual features. Therefore, we apply a general optimization technique to refine the matching result. The procedure minimizes the fitting cost function w.r.t. both the pose and the deformation parameters. Based on the prediction statistics in the previous section, the optimization needs only to perform a constrained search over a small parameter subspace. We have used the following optimization techniques in our experiments. 1) Gradient-descent: Although gradient-based algorithms are fast, they fail to yield satisfactory results in our case. Since the target function can have many local minima, the gradient-descent-based search can easily converge to local minima. Moreover, the computation of function derivatives is time-consuming. 2) Simulated annealing: Simulated annealing is a ‘global’ optimization technique, which makes use of random sampling in the parameter space. However, the tuning of the annealing parameters is difficult (e.g. the cooling rate and the sampling step). Preliminary experiments showed that the use of simulated annealing is much more computationally expensive than the Simplex method (illustrated below) and yields no better results. 3) Simplex algorithm: Although still a local optimization technique, the Simplex method allows occasional ‘jumps’ out of local minima. In our experiments, it gives the best trade-off between fitting accuracy and computation cost.
4
Experimental Results
In this section, we give the experimental results for eye extraction, using the same data set as given in Section 3.2. Pose parameter prediction. To measure the robustness and accuracy of the parameter prediction, we randomly perturb the pose parameters in the test set within the range specified in Table 1. The predicted parameters are compared with the ground-truth parameters, and the results are given in Table 2. It can be seen that the cascaded SVR prediction generally yields higher prediction accuracy, especially in prediction.
658
F. Zuo and P.H.N. de With
Feature extraction accuracy. To measure the feature extraction accuracy, we randomly position a mean shape near the ground-truth position in the test image and perform the model fitting. The average point-to-point error between the fitted shape and the manually labelled shape is measured (see Table 3). It can be seen that the use of the Simplex optimization effectively improves the extraction accuracy. Fig. 5 gives two examples of the eye extraction. A typical
Fig. 5. Stages of the eye extraction for the complete algorithm.
execution takes approximately 40-60 ms on a Pentium-IV PC (3 GHz), in which the SVR prediction takes one-third and the Simplex optimization takes twothird of the total execution time. This is much more efficient than using a direct optimization alone, which takes 300-400 ms under the same conditions.
5
Conclusions
In this paper, we have proposed a fast facial feature extraction technique for face recognition applications. The proposal contains three major contributions. First, we use support vector regression to train a parameter predictor for the feature model (Section 2), which is used to estimate the correct parameter displacements in a single step. Second, we use a cascade of SVR-based predictors with increasing convergence accuracy. The predictors are trained over data sampled with varying perturbation ranges, to give a performance that exchanges capture range with prediction accuracy. The cascading of these predictors thus combines a large capture range with a high prediction accuracy. Finally, a direct individual image optimization by the Simplex algorithm gives improved model parameters. The experimental results show an at least doubled convergence area compared to AAM with a higher accuracy. We are now applying the technique to a larger-scale database and insert it into an embedded/consumer face-recognition application.
Real-Time Facial Feature Extraction
659
References 1. Yuille, A., Cohen, D., and Hallinan, P.: Feature extraction from faces using deformable templates. Proc. CVPR. (1989) pp. 104–109 2. Cootes, T., Taylor, C.: Statistical models of appearance for computer vision. Tech. Rep. ISBE, Univ. Manchester. (2001) 3. Smola, A., Schölkopf, B.: A tutorial on support vector regression. Tech. Rep. NCTR-98-030, Univ. London. (1998) 4. Chang, C. C., Lin C. J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm. (2001) 5. Cheung, K-W., Yeung, D-Y, and Chin, R.: On deformable models for visual pattern recognition. Pattern Recog. 35 (2002) pp. 1507–1526 6. Duda, R., Hart, P., and Stork, D.: Pattern classification. (2001) 7. Stegmaan, M.: Analysis and segmentation of face images using point annotation and linear subspace techniques. Tech. Rep. DTU. (2002) 8. Nelder, J.A., and Mead, R.: A Simplex Method for Function Minimization. Computer Journal, vol. 7. (1965) pp. 308–313
Frontal Face Authentication Through Creaseness-Driven Gabor Jets Daniel González-Jiménez and José Luis Alba-Castro Departamento de Teoría de la Señal y Comunicaciones, Universidad de Vigo
Abstract. One of the most successful techniques for frontal face authentication is based on matching image graphs that represent faces. The local identity information is stored at node graph locations through vectors of multi-scale and multi-orientation Gabor image responses. In this work we present an efficient way to process frontal faces by automatically locating graph nodes at selected crease points. This method does not rely on rigorous face alignment, increases speed in the pre-processing step and results in a face-graph representation that obtains comparable results to other methods reported in the literature over the XM2VTS DataBase.
1 Introduction Automatic personal identity authentication based on face images has been a successful field of research mostly during the last years. Besides the extraordinary advances introduced in the field during this period of time, some of the originally stated problems involved in recognizing faces are still not properly solved. Some of the still open problems have become apparent as a consequence of the use of systematic protocols and database benchmarks, that have been applied to measure advances in the research [1] and industrial [2] community. Most of face recognition systems rely in a compact representation that encodes global and/or local information. Global approaches have given rise to a bunch of linear projection methodologies (among others: Principal component analysis (eigenfaces [3]), Linear Discriminant Analysis (Fisherfaces [4]), Independent Component Analysis [5], Non-Negative Matrix Factorization [6], etc.). These methods are devoted to encode faces in an efficient manner, and characterize the spanned face space or manifold. In this way, the coordinates of face images inside these manifolds allow to encode and measure similarity between faces (face recognition) in a compact and efficient manner. Local approaches have been based on finding and characterizing informative features such as eyes, nose, mouth, chin, eyebrows, etc. If their mutual relationships are considered then we have a local-global approach (Local Feature Analysis [7]). Inside this last group of methods we can find the Elastic Bunch Graph Matching [8]. In EBGM a sparse grid is overlaid on the face image during the training phase and its nodes are “adjusted” to a set of fiducial points. Every time an image of the same subject is shown to the system, the grid-nodes have to be automatically “moved” to their corresponding fiducial points. The convolution of a set of 2-D Gabor wavelets is computed at every grid-node and the output represents a local feature vector for that particular fiducial point of the face. Then, the global representation comes through the A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 660–667, 2004. © Springer-Verlag Berlin Heidelberg 2004
Frontal Face Authentication Through Creaseness-Driven Gabor Jets
661
mutual distances between grid-nodes for a particular face. One of the more critical parts of EBGM, either in terms of accuracy or in terms of computational burden, is the location of grid-nodes at fiducial points. It consists on the computation of a phasesensitive similarity function [8] over a range of positions and sizes from an initial rough estimate of the fiducial point location. In this paper we present an easier way to locate grid-nodes by taking advantage of illumination-invariant features from which geometry of face can be characterized. We have chosen the image Ridges and Valleys as shape feature descriptor [9]. This choice obeys to two main reasons, namely, their robustness to illumination changes and their capability of catching the face geometry in agreement with some cognitive science evidences. The rest of the paper is organized as follows: Section 2 describes the Ridges and Valleys operator that will be the base of our method. Section 3 describes the method itself and the practical considerations we took into account. Section 4 is dedicated to the experimental results over the XM2VTS DataBase, and section 5 gives some conclusions and future research lines.
2 Ridges and Valleys Operator Geometric descriptors are used as relevant low-level image features, being edge-based descriptors the most widely accepted. A second derivative approach called “ridges and valleys” plays an important role in many applications. In the literature we can find different mathematical characterizations that try to formalize the intuitive notion of creaseness (ridge/valley). In the context of face recognition, we have used the valleys obtained by thresholding the so-called multi local level set extrinsic curvature (MLSEC) [9-10]. We have chosen the MLSEC due to their invariance both to rigid image motions and monotonic grey-level changes and, mainly, because its high continuity and meaningful dynamic range, in opposition to other measures with the same invariance. Basically, the valleys based on the MLSEC are delineated by: computing the normalized gradient vector field of the smoothed image (usually a Gaussian smoothing). calculating the divergence of this vector field, which gives rise to a bounded and well-behaved measure of valleyness (positive values running from 0 to 2 in 2D) and ridgeness (negative values from -2 to 0). thresholding the response of the above described operator, so that image pixels where the MLSEC response is smaller than -1 are considered ridges, and those pixels larger than 1 are considered valleys. Besides of its desirable illumination invariant behaviour, the relevance of valleys in the face shape description has been pointed out by some cognitive science works [11]. Among others, Pearson et al, hypothesize that this kind of filters are used as an early step by the human visual system (HVS). They found their assertions in the human behaviour when recognizing faces. It can be summarized as: i) Valley positions gives to the observer a reliable 3D information of the shape of the object that is being observed (valleys of a 3D surface with uniform albedo are placed at those points where surface normal is perpendicular to the point-observer axis); ii) the response of a valley detector depicts the face in a similar way a human could have drawn it, showing the position, and extent of the main face features; iii) the ability of HVS
662
D. González-Jiménez and J.L. Alba-Castro
when recognizing faces decrease dramatically if negative images instead of positive ones are used. If the HVS was tuned to edges instead of valleys, it should recognize as well an image as its negation because edges in both images are placed at the same positions; on the other hand valley responses does not remain at the same position when the image is negated (valleys become ridges). Once the feature descriptor has been properly defined we have a way of describing fiducial points in terms of positions where the geometrical image features have been detected. In order to use this shape descriptor for face recognition or authentication, textural local information must be also taken into account [12]. Gabor wavelets are biologically motivated convolution kernels that capture texture local information and are also quite invariant to the local mean brightness, so an efficient face encoding approach will be to extract textural information from geometrically salience regions.
3 Creaseness-Driven Gabor Jets 3.1 Location of Fiducial Points Once the ridges and valleys in a new image have been extracted, we must compute Gabor responses in some of those points. There are some possible combinations, in terms of using just ridges, just valleys or both of them, so we will refer to the binary image, obtained as a result of the previous processing, as the sketch from now on. In order to keep just a reasonable number of fiducial points, a rectangular grid is applied onto the sketch, and each node changes its position, until it finds the closest line in the sketch. First of all, we start with dense rectangular grids, but in order to avoid overlapping between responses of filters and to reduce computational time, we must leave just a few of their nodes. So, we decided to establish a minimum distance D between each pair of nodes, so that all final positions are separated at least by D. We will extract textural information centred at those final points. Matching between pair of nodes from the two images is also necessary, in order to compare Gabor responses from same positions. Figure 1 shows the original rectangular grid (left) and a set of randomly selected points (right). Both of them will be used as baseline comparison. In figures 2 and 3 we can appreciate the way the original grid is adjusted to the sketch. It is important to highlight that we are supposing that faces have been previously located by any face detection algorithm and roughly scaled, but we do not suppose a rigorous face alignment to have the same eyes and/or mouth coordinates. This is a very important remark because many face recognition algorithms rely on geometrically normalized faces before encoding them.
Fig. 1. Left: regular rectangular grid placed on the face; right: position of fiducial points randomly selected (both of them will be used as baseline tests).
Frontal Face Authentication Through Creaseness-Driven Gabor Jets
663
Fig. 2. Left: Binary image showing the valleys of the original image; right: final positions of grid-nodes, adjusted to the valleys
Fig. 3. Left: Binary image showing the ridges of the original image; right: Final positions of grid-nodes, adjusted to the ridges
Gabor Filters This system uses a set of 40 Gabor filters (5 frequencies and 8 orientations), in order to obtain information from face images. These filters are convolution kernels in the shape of plane waves restricted by a Gaussian envelope, as it is shown next:
where deviation
contains information about scale and orientation, and the same standard is used in both directions for the Gaussian envelope.
The region surrounding a pixel in the image is encoded by the convolution of these filters, and the set of responses is called a jet. So, a jet is a vector with 40 coefficients, and it provides information about a specific region of the image. Each coefficient can be expressed as follows:
664
D. González-Jiménez and J.L. Alba-Castro
The jets will be computed over the input image I(x) that represents the original image normalized in illumination. This information will be stored for further comparison. Given two images A and B, we will refer to the set of jets computed for these images as and respectively. The similarity between and is given as:
where
represents the normalized dot product between the k-th jet from
and the k-th jet from but taking into account that only the moduli of jet coefficients are used. N stands for the number of jets per image. In the experimental results we show at the next section, every face is encoded as the coordinates of the N grid nodes with their 40-dimensional Gabor jet local response.
4 Experimental Results The proposed method has been tested for face authentication over the XM2VTS DataBase [1]. The database contains 295 persons with eight image shots each (2 shots x 4 sessions) captured under controlled conditions (with uniform background, more or less the same distance and without expression changes). There are, though, some lighting and personal aspect changes. Any biometric authentication system relies on the 1:1 comparison between the stored model of a claimed identity and the captured biometric data from the claimer. In this context we have the next definitions: client(impostor): a person that has (has not) a stored biometric model in the system access threshold: similitude value that thresholds the null hypothesis: “the claimer is the client he claims to be”. The access threshold can be the same for all the clients or adapted to each one. FAR (False Acceptance Rate): number of accepted impostors / number of impostor claimants FRR (False Rejection Rate): number of rejected clients / number of client claimants EER (Equal Error Rate): Operating point where both rates are equal. Experiments performed in this work have been carried out following the Laussane protocol. Configuration I of that protocol is resumed here: Training: 200 clients, 3 images/client (1 shot from sessions 1, 2 and 3) Evaluation: 200 clients, 3 images/client (the other shot from sessions 1, 2 and 3). 25 impostors, 8 images/impostor Test: 200 clients, 2 images/client (2 shots from session 4). 70 impostors (different from evaluation), 8 images/impostor
Frontal Face Authentication Through Creaseness-Driven Gabor Jets
665
The training phase is just the storing of the 200x3 image graphs. In the evaluation phase we need to adjust the access threshold(s) to satisfy a specified relationship between FAR and FRR. There are three important thresholds that define the performance of an authentication system: {th/FAR_ev=0}, {th/FRR_ev=0}, The ROC (Receiver Operating Characteristic) curve describes the performance of the system between the more restrictive case (FAR low) and the more permissive case (FAR high). We have tested our algorithm for different grid-nodes location methods: i) no restrictions at all (uniformly randomized locations), ii) a regular rectangular grid centred at the face image, iii) ridges-driven location, iv) valleys driven location, and v) additive fusion of ridges and valleys similarity values. We have tested also ridges&valleys-driven location but performance is poorer than the other 3 methods because too many feature creases causes more randomized grid-node locations. Figures 4 and 5 show the performance of the system for the evaluation set and for the test set using a common access threshold equal to the average of individual thresholds (th/EER_ev=0), and checking FRR and FAR over a symmetric interval.
Fig. 4. ROC curve for the evaluation set. Values of threshold on the EER line will be used as access thresholds also for the test set
Even when ROC curves for test are illustrative of the performance of the system, in a real scenario we have to set a prior threshold and just measure the FAR and FRR as we have reported. Table 1 shows the FRR and FAR values obtained for the access thresholds obtained from the evaluation set for the five different location methods. This table also shows some results from the best methods found in the literature [13].
666
D. González-Jiménez and J.L. Alba-Castro
Fig. 5. ROC curve for the test set using an interval over the access thresholds obtained from the evaluation set. The FRR and FAR values obtained for the access thresholds are marked over their corresponding curves.
From results on Table 1 we can conclude that locating fiducial points by using only ridges information yields better discriminative textural information than using valleys or even a fusion of both similarity results. As we could expect, the baseline comparison against a regular grid or a random generated grid, confirms that our approximation captures discriminative information useful for face authentication. Regarding the comparison to other methods in the literature (5 from the last AVBPA competition [13]), our approach works better than 2 of them and worse than the other 3. In any case our method is still not fine tuned. Selection of thresholds (and use of individual thresholds) can be improved and location of fiducial points can also be adjusted by cleaning up the binary image and controlling the adaptation of grid-nodes.
Frontal Face Authentication Through Creaseness-Driven Gabor Jets
667
5 Conclusions and Future Research In this work we have presented a technique to automatically locating grid-nodes at selected ridge and/or valley points and collect the textural information by convolving a set of Gabor filters over that face regions. This method does not need a rigorous face alignment, and results in a face-graph representation that obtains comparable results to other methods reported in the literature over the XM2VTS DataBase. We now need to adapt this method to tackle with image faces having larger scale variation and slight rotations in plane and in depth. We are starting to test it using the more realistic BANCA database.
References 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13.
The extended xm2vts database. http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/ and The BANCA database http://www.ee.surrey.ac.uk/Research/VSSP/banca/ The BioID database. http://www.bioid.com B. Moghaddam and A. Pentland, “Probabilistic Visual Learning for Object Representation,” IEEE Trans. on PAMI, 19(7), 696-710, 1997 P.N.Belhumer, J.P. Espanha and D.K. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition using class-specific linear projection,” IEEE Trans. on PAMI, 19(7), 711-720, 1997 M.S. Bartlett and T.J. Sejnowski, “Viewpoint invariant face recognition using independent component analysis and attractor networks,” NIPS, M.Mozer, et al.Editors, 1997, MIT Press D. Guillamet and J. Vitria, “Introducing a weighted non-negative matrix factorization for image classification,” Pattern Recognition Letters, 24:2447-2454, 2003 P. Penev and J. Atick, “Local feature analysis: a general statistical theory for object representation,” Network: Computation in Neural Systems 7 (August 1996) 477-500 L. Wiskott, J.MFellous, N.Kruger and C. von der Malsburg. “Face recognition by Elastic Bunch Graph Matching,” IEEE Trans. on PAMI, 19(7), 775-779, 1997 A. M. López, F. Lumbreras, J. Serrat and J. J. Villanueva, “Evaluation of Methods for Ridge and Valley Detection,” IEEE Trans. on PAMI, 21(4), 327-335, 1999 A. M. López, D. Lloret, J. Serrat and J. J. Villanueva, “Multilocal Creaseness Based on the Level-Set Extrinsic Curvarture,” CVIU, Academic Press, 77(2):111-144., 2000 D.E.Pearson, E. Hanna and K. Martinez, “Computer-generated cartoons, “ Images and Understanding, 46-60. Cambridge University Press, 1990 Jose L. Alba, Albeit Pujol and J.J. Villanueva, “Separating geometry from texture to improve face analysis,” in proc. IEEE ICIP, 673-676, Barcelona (Spain) 2001 J. Kittler et al., “Face verification competition on the xm2vts database,” IEEE International conference on Audio and Video Based Person Authentication, 2003, LNCS 2688, pp. 964–974, 2003.
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition Xiaoyi Feng1,2, Abdenour Hadid1, and Matti Pietikäinen1 1 Machine Vision Group Infotech Oulu and Dept. of Electrical and Information Engineering P. O. Box 4500 Fin-90014 University of Oulu, Finland
{xiaoyi,hadid,mkp}@ee.oulu.fi 2
College of Electronics and Information, Northwestern Polytechnic University 710072 Xi’an, China
[email protected]
Abstract. In this paper, a coarse-to-fine classification scheme is used to recognize facial expressions (angry, disgust, fear, happiness, neutral, sadness and surprise) of novel expressers from static images. In the coarse stage, the sevenclass problem is reduced to a two-class one as follows: First, seven model vectors are produced, corresponding to the seven basic facial expressions. Then, distances from each model vector to the feature vector of a testing sample are calculated. Finally, two of the seven basic expression classes are selected as the testing sample’s expression candidates (candidate pair). In the fine classification stage, a K-nearest neighbor classifier fulfils final classification. Experimental results on the JAFFE database demonstrate an average recognition rate of 77% for novel expressers, which outperforms the reported results on the same database.
1 Introduction Numerous algorithms for facial expression analysis from static images have been proposed [1,2,3] and the Japanese Female Facial Expression (JAFFE) Database is one of the common databases for testing these methods [4-10]. Lyons et al. provided a template-based method for expression recognition [4]. Input images were convolved with the Gabor filters of five spatial frequencies. Then the amplitude of the complex-valued filter responses were sampled on 34 manually selected fiducial points and combined into a single vector, containing 1020 elements. The principal components analysis (PCA) was used to reduce the dimensional of data and finally a simple LDA-based classification scheme was used. Zhang et al. [5,6] used a similar representation for face while they applied wavelet of 3 scales and 6 orientations. They also considered geometric position of the 34 fiducial points as features and used a multi-layer perceptron for recognition. Guo and Dyer [7] also adopted a similar face representation and they used linear programming technique to carry out simultaneous feature selection and classifier training. Buciu et al. [8]
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 668–675, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition
669
adopted ICA and Gabor representation for facial expression recognition. Neural Networks have been considered in [9, 10]. Recognizing the expressions of novel individual is still a challenging task and only few works have addressed this issue [1, 4, 10]. In this paper, a modified template-based classification method is proposed for novel expressers expression recognition. The template-based techniques are simple face representation and classification methods. They have only limited recognition capabilities, which may be caused by smoothing of some important individual facial details, by small misalignment of the faces, and also by large inter-personal expression differences, but they can discriminate typical and common features. In our work, a coarse-to-fine classification method is adopted, aiming to make use of the advantages of template-based methods and at the same time to weaken their shortcomings mentioned above. In the coarse classification stage, seven model vectors (templates) are formed for the seven basic facial expressions. Then distances between each template and a testing sample are calculated with the Chi square statistic. The two nearest expression classes (candidate pair) are selected as candidate expressions. As a result, seven-class classification is reduced to a two-class classification. Since the traditional template-based methods have the ability to discriminate main facial expression features, the real expression class of the testing sample has a high probability of belonging to one of the two candidate expressions. To minimize the disadvantage of traditional template-based methods, seven templates are substituted for multi-template pairs, and weighted Chi square statistic replaces former Chi square statistic as dissimilarity measure in the fine classification stage. A simple K-nearest neighbor classifier follows to finally classify the testing sample. The rest of the paper is organized as follows: face representation is introduced in section 2. In section 3, the coarse-to-fine expression classification method is presented. Experimental results are described in section 4. Finally, we conclude the paper.
2 Face Representation Fig. 1 illustrates the basic LBP operator [11]. The 3×3 neighborhood is threshold by the value of the center pixel, and a binary pattern code is produced. The LBP code of the center pixel is obtained by converting the binary code into a decimal one. Based on this operator, each pixel of an image is labeled with an LBP code. The 256-bin histogram of the labels contains the density of each label over a local region, and can be used as a texture descriptor of the region. Recently, an LBP based facial representation has shown an outstanding result in face recognition [12]. In our work, we use a similar facial representation as that proposed in [12]: Divide the face image into small regions. The size of each pre-processed image is 150×128. After experimenting with different block sizes, we choose to divide the image into 80 (10×8) non-overlapping blocks of 15×16 pixels (see Fig. 2). Calculate the LBP histogram from each region. The LBP histogram of each region is obtained by scanning it with the LBP operator.
670
X. Feng, A. Hadid, and M. Pietikäinen
Fig. 1. The basic LBP operator
Concatenate the LBP feature histograms into a single feature vector. LBP histogram of each region is combined together to form a single feature vector representing the whole image.
Fig. 2. An example of a facial image divided into 10×8 blocks
The idea behind using our approach for feature extraction is motivated by the fact that emotion is more often communicated by facial movement, which will change visible appearance. Our feature extraction method is capable of presenting facial appearances and so it can be used for representing facial expressions.
3 Coarse-to-Fine Classification Though traditional template-based approaches have only limited recognition capabilities, they are quite simple and can reflect main and common features. Experiments have shown that they are effective in recognizing intense and typical expressions. Based on that, they are used in our coarse classification procedure to reduce a seven-class to a two-class classification problem. To overcome their shortcomings, multi-template pairs and a K nearest neighbor classifier are used in the fine classification.
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition
671
3.1 Coarse Classification In this stage, the classification is performed using a two-nearest neighbor classifier with Chi square as dissimilarity measure. Feature vectors of same expression class of training samples are averaged to form model vectors and so seven model vectors are constructed. The testing vector is extracted from a testing sample. Distances from each model vector to the testing vector are calculated. Consider a training set X containing n d -dimensional feature vectors. The training set is divided into seven subsets and each subset corresponds to one expression. Let denotes the subset with (c = l,2,...7) vectors and
is the
feature vector. So
The model vector (denoted as of the
of the
expression class is the cluster center
subset.
A chi square
statistic is used as dissimilarity measure between a testing sam-
ple and models. Suppose s is the test vector and
The weighted chi square in our fine classification later.
is its
element, we have
statistic [12] is defined as follows and will be used
Instead of classifying the test sample into one expression class, we choose two expression classes as candidates cc = {c1,c2}, c1 and c2 subject to
672
X. Feng, A. Hadid, and M. Pietikäinen
3.2 Fine Classification To overcome the shortcomings of traditional template-based techniques, multitemplate pairs are used in the fine classification stage, replacing simple seven templates. A simple K-nearest neighbor classifier is also used in this stage. Our experimental results favor our fine classification ideas. When we analyzed the results of the coarse classification, we noticed that more than 50% of testing samples that were wrongly recognized have the second nearest expression class as their real expression class. This shows that the template-based method has the ability to discriminate expressions in a coarse level and we need some other methods to discriminate expressions in a fine level. The following steps are used in fine classification. First, multi-template pairs are formed for each pair of candidate expressions. Each template pair corresponds to one expresser in training set. The multi-template pairs are formed as follows: In the case of one expression candidate is neutral: Suppose the other expression is c. For each expresser in the training set, distances between each feature vector in expression c and that in neutral are calculated by formula (3). A template pair with the nearest distances is selected as one template pair for the neutral- c classification. The above procedure is repeated for all expressers in the training set. Regions containing more useful information for expression classification are given a relatively high weight value. The aim for forming template pairs in the above way is to minimize the distance between each pair to ensure that expressions with weak intensity are classified correctly. In the case neither of the expression candidates is neutral: Denote the two expression candidates as c1 and c2. For each expresser in the training set, suppose feature vector corresponds to the center of feature vectors of expression c1, and corresponds to the center of feature vectors of expression c2. So vector pair forms one template pair for the c1 - c2 classification. The above procedure is repeated for all expressers in the training set and so the number of template pair is that of expressers in the training set. Once multi-template pairs are formed for one candidate pair, the weighted chi square statistic is used as dissimilarity measure. Since more than one template pairs are employed for one candidate pair, we use a simple K-nearest neighbor classifier for the two-class classification in this stage.
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition
673
4 Experiments and Results Our method is tested on the Japanese Female Facial Expression (JAFFE) Database [13]. The database contains 213 images of ten expressers posed 3 or 4 examples of each of the seven basic expressions (happiness, sadness, surprise, anger, disgust, fear, neutral). Sample images from the database are shown in Fig. 3.
Fig. 3. Samples from the Japanese Female Facial Expression Database
There are mainly two ways to divide the JAFFE database. The first way is to divide the database randomly into 10 roughly equal-sized segments, of which nine segments are used for training and the last one for testing. The second way is to divide the database into several segments, but each segment corresponds to one expresser. In our experiments, image pre-processing is conducted by the pre-processing subsystem of the CSU Face Identification Evaluation System [14]. As a result, the size of each pre-processed image is 150×128 (see Fig. 4).
Fig. 4. Samples from the preprocessed images
To compare our results to those of other methods, a set of 193 expression images posed by nine expressers is used. These images are partitioned into nine segments, each corresponding to one expresser. Eight of the nine segments are used for training and the ninth for testing. The above process is repeated so that each of the nine partitions is used once as the test set. The average of recognizing the expression of novel expressers is 77% (Recognition results of each trail are in Table 1). Now we compare the recognition performance to other published methods using the same database. In [4], a result of 75% using Linear Discriminant Analysis (LDA) was reported with 193 images. In [10], an average recognition result of 30% was reported with 213 images.
674
X. Feng, A. Hadid, and M. Pietikäinen
Other reports [5-9] on the same database did not give the recognition rate for novel expressers expression. It should be pointed out that in [4], 34 fiducial points have to be selected manually. In our method, we need only the position of two pupils for face normalization and other procedures are completely automatic. It should also be noted that in the JAFFE database, some expressions had been labeled incorrectly or expressed inaccurately. Whether these expressional images are used for training or testing, the recognition result is influenced. Fig. 5 shows a few examples with the labeled expression and our recognition results.
Fig. 5. Examples of disagreement. From left, the labeled expressions are sadness, sadness, sadness, surprise, fear, disgust, happiness, and the recognition results are happiness, neutral, neutral, happiness, sadness, angry and neutral, respectively
5 Conclusion How to recognize facial expressions of a novel expresser from static images is one of the challenging tasks in facial expression recognition. The template-based techniques can reflect main and typical features but they will smooth some important individual features. A coarse-to-fine classification scheme is used so that the classification can utilize the advantages of the template-based techniques and minimize their disadvantages. The combination of multi-template pairs, the weighted Chi-square and Knearest neighbor classifier provides a good solution. Experimental results demonstrated that our method performs better than other methods on the JAFFE database. Acknowledgement. The authors thank Dr. M. Lyons for providing the Japanese Female Facial Expression (JAFFE) Database. The authors also thank CIMO of Finland and the China Scholarship Council for their financial support for this research work.
A Coarse-to-Fine Classification Scheme for Facial Expression Recognition
675
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14.
M. Pantic, Leon J.M. Rothkrantz: Automatic analysis of facial expressions: the state of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 (2000) 1424-1445 B. Fasel and J. Luettin: Automatic facial expression analysis: A survey, Pattern Recognition, Vol. 36 (2003) 259-275 W. Fellenz, J. Taylor, N. Tsapatsoulis, S. Kollias: Comparing template-based, featurebased and supervised classification of facial expression from static images, Computational Intelligence and Applications, (1999) M. Lyons, J. Budynek, S. Akamastu: Automatic classification of single facial images, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 21(1999) 1357-1362 Z. Zhang: Feature-based facial expression recognition: Sensitivity analysis and experiment with a multi-layer perceptron, Pattern Recognition and Artificial Intelligence, Vol. 13(1999) 893-911 Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu: Comparison between geometry-based and Garbor-wavelet-based facial expression recognition using multi-layer perceptron, In: Third International Conference on Automatic Face and Gesture Recognition. (1998) 454459 G. D. Guo, C. R. Dyer: Simultaneous feature selection and classifier training via linear programming: A case study for face expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2003) 346-352 I. Buciu, C. Kotropoulos, I. Pitas: ICA and gabor representation for facial expression recognition. In: International Conference on Image Processing. (2003) 855-858 B. Fasel: Head-pose invariant facial expression recognition using convolutional neural networks. In: Fourth IEEE Conference on Multimodal Interfaces. (2002) 529– 534 M. Gargesha, P. Kuchi: Facial expression recognition using artificial neural networks, EEE 511: Artificial Neural Computation Systems, (2002) T. Ojala, M. Pietikäinen, T. Mäenpää: Multiresolution gray-scale and rotation invariant texture classification with Local Binary Patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24(2002) 971-987 T. Ahonen, A. Hadid, M. Pietikäinen: Face recognition with local binary patterns. In: 8th European Conference on Computer Vision. (2004) 469-481 M. Lyons, S. Akamastu, M. Kamachi, J. Gyoba: Coding facial expressions with Gabor wavelets. In: Third IEEE Conference on Face and Gesture Recognition. (1998) 200-205 D. Bolme, M. Teixeira, J. Beveridge, B. Draper: The CSU face identification evaluation system user’s guide: its purpose, feature and structure. In: Third International Conference on Computer Vision Systems. (2003) 304 –313
Fast Face Detection Using QuadTree Based Color Analysis and Support Vector Verification Shu-Fai Wong and Kwan-Yee Kenneth Wong Department of Computer Science and Information Systems The University of Hong Kong, Hong Kong {sfwong, kykwong}@csis.hku.hk
Abstract. Face detection has potential applications in a wide range of commercial products such as automatic face recognition system. Commonly used face detection algorithms can extract faces from images accurately and reliably, but they often take a long time to finish the detection process. Recently, there is an increasing demand of real time face detection algorithm in applications like video surveillance system. This paper aims at proposing a multi-scale face detection scheme using Quadtree so that the time complexity of the face detection process can be reduced. By performing analysis from coarse to fine scales, the proposed scheme uses skin color as a heuristic feature, and support vector machine as a verification tool to detect face. Experimental results show that the proposed scheme can detect faces from images reliably and quickly. Keywords: Object Detection, Pattern Recognition, Color Analysis
1
Introduction
Face detection is one of the hot topics in pattern recognition. It receives much attention mainly because of its wide range of applications. Intelligent video surveillance system, reliable human-computer interface, and automatic formation of face database are some examples of commercial and industrial applications using face detection as a major component. Commonly used face detection algorithms can detect face accurately. A comprehensive survey on face detection can be found in [1], [2]. Most of these detection algorithms perform a brute force search of face patterns in the image. In other words, pattern recognition has to be performed at different scales and positions within the image. Pattern recognition techniques such as neural networks [3], gabor filters [4], and Markov random field [5] have been used in face detection recently. The major problem of such kinds of algorithms is that it is time consuming and hence cannot be applied to a real time system. To reduce the time complexity, researchers start investigating the use of visual cues to facilitate the searching process. Motion, color, and configuration of facial features have been used as visual cues. Among these cues, color is most widely used [6], [7], [8]. Skin color can be learnt from face database. By A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 676–683, 2004. © Springer-Verlag Berlin Heidelberg 2004
Fast Face Detection Using QuadTree Based Color Analysis
677
performing appropriate color-space transformation, the projection of skin colors on such a color-space will form a cluster. Potential face region can be identified by comparing the distance between the projection of the pixel’s intensity and the center of the cluster in that color space. However, the time complexity of this approach is still high. This is because time consuming morphological analysis is usually done before face verification. There is an alternative that replacing the whole pattern recognition step by the visual cues detection so that detection process can be fastened a lot. However, the accuracy will drop dramatically, especially when there is a complex and distracting background. As described above, detecting faces in image reliably and quickly is a difficult problem. This paper aims at proposing a multi-scale analysis scheme for face detection. Our idea is similar to that presented by Sahbi et. al. [9]. However, they haven’t explained explicitly how multi-scale analysis is done and only illustrated a simple and unreliable verification scheme based on histogram. In our work, an explicit implementation details will be explored and in addition, speed up methods and reliable verification scheme are proposed. Under our proposed scheme, multi-scale analysis using Quadtree and wavelet decomposition will be performed on the input image. Skin color will be detected starting from low resolution images. Face verification using support vector machine (SVM) and wavelet coefficient as input will be done once skin color region is reported. Experimental results show that the scheme can detect face efficiently and reliably. The overview of the proposed scheme will be covered in Section 2. The implementation details will be explored in Section 3, 4 and 5. Experimental results will be shown in Section 6.
2
System Overview – QuadTree Based Searching
The proposed system consists of three major components, namely the skin color detection module, the wavelet transformation module and the support vector verification module. The skin color detection module will extract skin color region from the image. The wavelet transformation module will transform the image into an image pyramid that consists of images at different resolutions and then form a primitive Quadtree for efficient skin color searching. The support vector verification module will determine whether the input pattern is a face pattern. The work flow of the system is shown in Algorithm 1. In general, by starting analysis from lowest resolution and by limiting the range of resolution of the image to be analysed, the total number of pixels to be analysed is small and the algorithm is thus fast. In addition, no further morphological analysis is needed because coarse-to-fine analysis using Quadtree structure is indeed performing image smoothing and noise filtering implicitly. Although it seems that the wavelet decomposition step causes serious computational load compare with simple blurring in order to generate image pyramid, the wavelet coefficients calculated can be used in the verification step which increases the reliability of the system. The system is supposed to detect face reliably in reasonable time.
678
3
S.-F. Wong and K.-Y.K. Wong
Color Analysis
According to recent research, skin color tends to form cluster in different color space [10]. In the proposed system, color will be used as the only heuristic feature. Skin color pixel can be identified by comparing the given pixel with the skin color distribution or model. Skin color model is learnt in offline. Face images taken from office environment under different lighting condition are used in skin color learning. Skin pixels are separated manually. Those skin color pixels extracted will be converted from RGB color space to normalized rg-color space, through and A histogram is then formed from those manually selected skin pixel in rg-color space. The histograms obtained in this step is illustrated in figure 1. As indicated in the figure, the distribution of skin color is quite compact and thus it can be approximated by Gaussian distribution where M is skin color model, and are the mean and covariance of the skin color distribution. Thus, can be approximated by the mahalanobis distance of certain rg-value from the distribution of the model M, where is the normalized rg-value of a given testing pixel. Detected region can be reported by thresholding those pixel with high value.
Fig. 1. The histogram on the left shows the distribution of normalized rg-value of skin pixels while the histogram on the right shows those distribution of the non-skin pixels.
Fast Face Detection Using QuadTree Based Color Analysis
4
679
Multi-scale Wavelet Analysis
Image can be broken down into constituent images with different resolution through wavelet transform and an image pyramid can then be formed. Under Quadtree based searching scheme, color analysis can start from the tip of the pyramid (images with lowest resolution) downwards to the bottom of the pyramid (images with highest resolution). Information from analysis at a lower resolution can be used in analysis at a higher resolution. Analysis at a lower resolution can thus determine whether it is necessary to explore certain set of pixels at a higher resolution. Time can hence be saved by avoiding unnecessary traversal of pixels at higher resolution. Besides, by assuming faces detected are of similar size, we can limit the searching depth to finite number of levels after skin color is first detected and hence increase the efficiency. The result of discrete wavelet transform is shown in figure 2. Mathematical details of wavelet theory can be found in [11].
Fig. 2. Image pyramid formed from discrete wavelet transform. At each transform step, 4 components are extracted. The component at the top left corner of the result image is that on the top right corner is that on the bottom right corner is and that on the bottom left corner is Transformation is done recursively on according to the wavelet theory.
According to the wavelet theory, the image signal can be broken down into wavelets:
where is the scaling function and is the wavelet function. The corresponding scaling and wavelet coefficient are and respectively. The scaling coefficients form the images at different resolution while the wavelet coefficients form the feature vector in face verification step.
680
S.-F. Wong and K.-Y.K. Wong
The discrete wavelet transform can be done by:
In the system, Daubechies wavelet is used because it has associated speed up algorithms in wavelet decomposition. Efficiency in wavelet decomposition is thus increased.
5
Verification by Support Vector Machine
Support vector machine have been widely used in face detection and recognition recently due to its non-linear classification power [12], [13]. Given data set: support vector machine can learn to find out the association between and In the proposed system, the will be the normalized wavelet coefficient set after discrete wavelet transform of the image in the Quadtree at the level where skin color is detected and {+1, –1} refers to face and non-face classes. Wavelet coefficients are used because it is illumination insensitive and thus much robust in detection than scaling coefficients. During pre-learning phase, discrete wavelet transform will be performed on the face images (inside the face database) and corresponding wavelet coefficients of a certain face image will be extracted as a feature vector. The ORL Database of Faces (http://www.uk.research.att.com/facedatabase.html) was used to train the support vector machine. The feature extraction result is shown in figure 3. During learning phase, the support vector machine will be trained to learn the face pattern. During testing phase, the wavelet coefficients correspond to skin region reported by skin color module will be converted to a feature vector and is classified by the support vector machine. Face pattern can then be verified. Details of support vector machine can be found in [14]. In order to use support vector machine, kernel function should be defined. In the proposed system, gaussian RBF kernel, The determinant function can be written as:
During learning phase, criteria function:
are learnt from data set
is used.
under the following
Fast Face Detection Using QuadTree Based Color Analysis
681
Fig. 3. The wavelet coefficients are extracted from the face images (extracted from ORL Database of Faces). These coefficients will be used to train the support vector machine.
In the system,
are learnt through gradient ascent:
where is the learning rate. During face verification or testing phase, equation (5) can be used to determine whether the input is a face pattern.
6
Experiments and Results
The proposed face detection algorithm was implemented using Visual C++ under Microsoft Windows. The experiments were done on a P4 2.26 G Hz computer with 512M ram running Microsoft Windows. The system was tested by detecting faces in image. The qualitative results are shown in figure 4 and 5 which show that faces can be detected even under illumination variation and existence of distractor (skin color of limbs). The quantitative results are shown in Table 1. Comparison with face detection algorithm using color analysis alone was made. The qualitative results of using color-based algorithm are shown in figure 6. The figure shows that the algorithm does not work reliably and easily affected by distractors. In addition, the run time is around 1.5 seconds which is not faster than the proposed algorithm significantly. If face verification module is added without using multi-scale analysis. The accuracy does improves, but the run time raises to over 60s on average. This is mainly due to the time consuming morphological operations and brute force verification within skin color region.
7
Conclusion
Face detection is useful in various industrial applications like human-computer interface. However, commonly used face detection algorithms are time consuming
682
S.-F. Wong and K.-Y.K. Wong
Fig. 4. Each column in this figure shows the face detection result of the input image at first row. The input images show variation in illumination. Second row shows the resultant binary image of the skin detection. Third row shows skin regions detected (in white color) and the face detected (in green box).
Fig. 5. This figure shows the face detection result on the image with multiple faces. The left image is the input image. The middle one shows the resultant binary image of the skin detection. The right image shows skin regions detected (in white color) and faces detected (in green box).
Fig. 6. This figure shows the face detection result on images using common colorbased detection algorithm. The left column shows the result on single face detection while the second column shows the result on multiple faces detection. The input image is the same as those used in previous experiments. The first row shows skin regions detected (in white color) and the second row shows the resultant binary image of skin detection.
Fast Face Detection Using QuadTree Based Color Analysis
683
even if visual cues are used. This paper proposed a multi-scale analysis scheme using Quadtree that start searching for visual cue (skin color) from coarse to fine scale. Searching time is reduced because possible regions will be explored at a lower resolution and the searching is limited to appropriate range of resolution. In addition, face verification ensure high accuracy of detection. Experimental results show that the proposed algorithm can detect faces efficiently and reliably. Note that, the proposed algorithm assumes faces in the image are of similar size and are of frontal view. Detection of face from different views, depth and size will be investigated in future.
References 1. Hjelmas, E., Low, B.K.: Face detection: A survey. CVIU 83 (2001) 236–274 2. Yang, M., Kriegman, D., Ahuja, N.: Detecting faces in images: A survey. PAMI 24 (2002) 34–58 3. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. PAMI 20 (1998) 23–38 4. Wu, H., Yoshida, Y., Shioyama, T.: Optimal gabor filters for high speed face identification. In: ICPR02. (2002) I: 107–110 5. Dass, S., Jain, A., Lu, X.: Face detection and synthesis using markov random field models. In: ICPR02. (2002) IV: 201–204 6. Wang, H., Chang, S.F.: A highly efficient system for automatic face region detection in mpeg video. IEEE Transactions on circuits and system for video technology 7 (1997) 615–628 7. Hsu, R.L., Abdel-Mottaleb, M., Jain, A.K.: Face detection in color images. In: ICIP01. (2001) I: 1046–1049 8. Phung, S.L., Bouzerdoum, A., Chai, D.: A novel skin color model in ycbcr color space and its application to human face detection. In: ICIP02. (2002) I: 289–292 9. Sahbi, H., Boujemaa, N.: Coarse-to-fine face detection based on skin color adaption. In: ECCV’s 2002 Workshop on Biometric Authentication. (2002) 112–120 10. Swain, M., Ballard, D.: Color indexing. IJCV 7 (1991) 11–32 11. Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia (1992) 12. Ai, H., Liang, L., Xu, G.: Face detection based on template matching and support vector machines. In: ICIP01. (2001) I: 1006–1009 13. Fransens, R., DePrins, J., Gool, L.J.V.: Svm-based nonparametric discriminant analysis, an application to face detection. In: ICCV03. (2003) 1289–1296 14. Scholkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999)
Three-Dimensional Face Recognition: A Fishersurface Approach Thomas Heseltine, Nick Pears, and Jim Austin Department of Computer Science, The University of York, United Kingdom
Abstract. Previous work has shown that principal component analysis (PCA) of three-dimensional face models can be used to perform recognition to a high degree of accuracy. However, experimentation with two-dimensional face images has shown that PCA-based systems are improved by incorporating linear discriminant analysis (LDA), as with Belhumier et al’s fisherface approach. In this paper we introduce the fishersurface method of face recognition: an adaptation of the two-dimensional fisherface approach to three-dimensional facial surface data. Testing a variety of pre-processing techniques, we identify the most effective facial surface representation and distance metric for use in such application areas as security, surveillance and data compression. Results are presented in the form of false acceptance and false rejection rates, taking the equal error rate as a single comparative value.
1
Introduction
Despite significant advances in face recognition technology, it has yet to achieve the levels of accuracy required for many commercial and industrial applications. The high error rates stem from a number of well-known sub-problems. Variation in lighting conditions, facial expression and orientation can all significantly increase error rates, making it necessary to maintain consistent image capture conditions between query and gallery images. However, this approach eliminates a key advantage offered by face recognition: a passive biometric that does not require subject co-operation. In an attempt to address these issues, research has begun to focus on the use of threedimensional face models, motivated by three main factors. Firstly, relying on geometric shape, rather than colour and texture information, systems become invariant to lighting conditions. Secondly, the ability to rotate a facial structure in threedimensional space, allowing for compensation of variations in pose, aids those methods requiring alignment prior to recognition. Thirdly, the additional discriminatory depth information in the facial surface structure, not available from two-dimensional images, provides supplementary cues for recognition. In this paper we investigate the use of facial surface data, taken from 3D face models (generated using a stereo vision 3D camera), as a substitute for the more familiar two-dimensional images. A number of investigations have shown that geometric facial structure can be used to aid recognition. Zhao and Chellappa [1] use a generic 3D face model to normalise facial orientation and lighting direction prior to A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 684–691, 2004. © Springer-Verlag Berlin Heidelberg 2004
Three-Dimensional Face Recognition: A Fishersurface Approach
685
recognition, increasing accuracy from approximately 81% (correct match within rank of 25) to 100%. Similar results are witnessed in the Face Recognition Vendor Test [2], showing that pose correction using Romdhani et al’s 3D morphable model technique [3] reduces error rates when applied to the FERET database. Blanz et al [4] take a comparable approach, using a 3D morphable face model to aid in identification of 2D face images. Beginning with an initial estimate of lighting direction and face shape, Blanz et al iteratively alters shape and texture parameters of the morphable face model, minimising difference to the two-dimensional image. These parameters are then taken as features for identification, resulting in 82.6% correct identifications on a test set of 68 people. Although these methods show that knowledge of three-dimensional face shape can aid normalisation for two-dimensional face recognition systems, none of the methods mentioned so far use actual geometric structure to perform recognition. Whereas Beumier and Acheroy [5, 6] make direct use of such information, testing various methods of matching 3D face models, although few were successful. Curvature analysis proved ineffective, and feature extraction was not robust enough to provide accurate recognition. However, Beumier and Acheroy were able to achieve reasonable error rates using the curvature of vertical surface profiles. Verification tests carried out on a database of 30 people produced EERs between 7.25% and 9.0% on the automatically aligned surfaces and between 6.25% and 9.5% when manual alignment was used. Chua et al [4] attempt to identify and extract rigid areas of 3D facial surfaces, creating a system invariant to facial expression. The similarity of two face models is computed by comparing a set of unique point signatures for each face. Identification tests show that the probe image is identified correctly for all people when applied to a test set of 30 depth maps of 6 different people. Hesher et al [7] use PCA of depth maps and a euclidean distance metric to perform identification with 94% accuracy on 37 face models (when training is performed on the gallery set). Further investigation into this approach is carried out by Heseltine et al [8], showing how different surface representations and distance measures affect recognition, reducing the EER from 19.1% to 12.7% when applied to a difficult test set of 290 face models. Having achieved reasonable success from the PCA-based eigensurface system in previous work [8], we now continue this line of research, experimenting with another well-known method of face recognition, namely the fisherface approach as described by Belhumeur et al [9], adapted for use on three-dimensional face data. Testing a range of surface representations and distance metrics, we identify the most effective methods of recognising faces using three-dimensional surface structure.
2
The 3D Face Database
Until recently, little three-dimensional face data has been publicly available for research and nothing towards the magnitude required for development and testing of three-dimensional face recognition systems. In these investigations we use a new database of 3D face models, recently made available by The University of York, as part of an ongoing project to provide a publicly available 3D Face Database [10].
686
T. Heseltine, N. Pears, and J. Austin
Face models are generated in sub-second processing time from a single shot with a 3D camera, using a stereo vision technique enhanced by light projection. For the purpose of these experiments we select a sample of 1770 face models (280 people) captured under the conditions in Fig. 1. During data acquisition no effort was made to control lighting conditions. In order to generate face models at various head orientations, subjects were asked to face reference points positioned roughly 45° above and below the camera, but no effort was made to enforce precise orientation.
Fig. 1. Example face models taken from The University of York 3D face database
3D models are aligned to face directly forwards before conversion into depth map representation. The database is then separated into two disjoint sets: the training set consisting of 300 depth maps (6 depth maps of 50 people) and a test set of the remaining 1470 depth maps (230 people), consisting of all capture conditions shown in Fig. 1. Both the training and test set contain subjects of various race, age and gender and nobody is present in both the training and test sets.
3
Surface Representations
It is well known that the use of image processing can significantly reduce error rates of two-dimensional face recognition methods [11, 12, 13], by removing effects of environmental capture conditions. Much of this environmental influence is not present in 3D face models, however Heseltine et al [8] have shown that such pre-processing may still aid recognition by making distinguishing features more explicit and reducing noise content. In this section we describe a variety of surface representations, derived from aligned 3D face models, which may affect recognition error rates. Preprocessing techniques are applied prior to both training and test procedures, such that a separate surface space is generated for each surface representation and hence a separate face recognition system.
Three-Dimensional Face Recognition: A Fishersurface Approach
4
687
The Fishersurface Method
In this section we provide details of the fishersurface method of face recognition. We apply PCA and LDA to surface representations of 3D face models, producing a subspace projection matrix, as with Belhumier et al’s fisherface approach [9], taking advantage of ‘within-class’ information, minimising variation between multiple face models of the same person, yet maximising class separation. To accomplish this we use a training set containing several examples of each subject, describing facial structure variance (due to influences such as facial expression), from one model to another. From the training set we compute three scatter matrices, representing the within-class between-class and total distribution from the average surface and classes averages as shown in equation 1.
688
T. Heseltine, N. Pears, and J. Austin
The training set is partitioned into c classes, such that all surface vectors in a single class are of the same person and no person is present in multiple classes. Calculating eigenvectors of the matrix and taking the top 250 (number of surfaces minus number of classes) principal components, we produce a projection matrix This is then used to reduce dimensionality of the within-class and between-class scatter matrices (ensuring they are non-singular) before computing the top c-1 eigenvectors of the reduced scatter matrix ratio, as shown in equation 2.
Finally, the matrix is calculated, such that it projects a face surface vector into a reduced space of c-1 dimensions, in which between-class scatter is maximised for all c classes, while within-class scatter is minimised for each class Like the eigenface system, components of the projection matrix can be viewed as images, as shown in Fig. 2 for the depth map surface space.
Fig. 2. The average surface (left) and first five fishersurfaces (right)
Once surface space has been defined, we project a facial surface into reduced surface space by a simple matrix multiplication, as shown in equation 3.
The vector is taken as a ‘face-key’ representing the facial structure in reduced dimensionality space. Face-keys are compared using either euclidean or cosine distance measures as shown in equation 4.
An acceptance (facial surfaces match) or rejection (surfaces do not match) is determined by applying a threshold to the distance calculated. Any comparison producing a distance value below the threshold is considered an acceptance.
Three-Dimensional Face Recognition: A Fishersurface Approach
5
689
The Test Procedure
In order to evaluate the effectiveness of a surface space, we project and compare the 1470 face surfaces with every other surface in the test set, no surface is compared with itself and each pair is compared only once (1,079,715 verification operations). The false acceptance rate (FAR) and false rejection rate (FRR) are then calculated as the percentage of incorrect acceptances and incorrect rejections after applying a threshold. Varying the threshold produces a series of FAR FRR pairs, which plotted on a graph produce an error curve as seen in Fig. 5. The equal error rate (EER, the point at which FAR equals FRR) can then be taken as a single comparative value.
Fig. 3. Flow chart of system evaluation procedure
6
Results
In this section we present results gathered from performing 1,079,715 verification operations on the test set of 1470 face models, using the surface representations described in section 3. Systems are tested separately using Euclidean and cosine distance measures. In addition we provide a direct comparison to the eigensurface method [8] trained and tested using the same face models, distance metrics and the same number of (c-1) principal components.
Fig. 4. EERs of fishersurface and eigensurface systems using two distance metrics
690
T. Heseltine, N. Pears, and J. Austin
Fig. 4 shows the diversity of error for eigensurface and fishersurface methods, using cosine and Euclidean metrics for the range of surface representations. The initial depth map produces an EER of 23.3% (euclidean distance) and 15.3% (cosine distance). This trend is common for all fishersurface systems, with the cosine distance typically producing three quarters of the error produced by the euclidean distance. In all cases the EERs of the fisherface system are lower than the equivalent eigensurface method. Surface gradient representations are the most distinguishing, with horizontal derivatives providing the lowest error of 11.3% EER.
Fig. 5. Fishersurface system error curves using two distance metrics and surface representations
7
Conclusion
We have applied a well-known method of two-dimensional face recognition to threedimensional face models using a variety of facial surface representations. The error rates produced using the initial depth map representation (15.3% and 23.3% EER) show a distinct advantage over the previously developed eigensurface method (32.2% and 24.5% EER). This is also the case for the optimum surface representations, producing 11.3% EER for the fishersurface system and 24.5% EER for the eigensurface method. We also note an increase in the eigensurface EERs compared to those reported in previous work [8]. This could be attributed to the different training and test data, or possibly the different number of principal components used. Experimenting with a number of surface representations, we have discovered common characteristics between the eigensurface and fishersurface methods: facial surface gradients provide a more effective representation for recognition, with horizontal gradients producing the lowest error rate (11.3% EER). Another observation, also common to the eigensurface method is that curvature representations seem to be least useful for recognition, although this could be a product of inadequate 3D model resolution and high noise content. In which case smoothing filters and larger convolution kernels may produce better results. The fishersurface method appears to produce better results than corresponding twodimensional fisherface systems (17.8% EER) tested under similar conditions in previ-
Three-Dimensional Face Recognition: A Fishersurface Approach
691
ous investigations [13], although a more direct comparison is required, using a common test database, in order to draw any quantitive conclusions. Testing two distance measures has shown that the choice of metric has a considerable effect on resultant error rates. For all surface representations, the cosine distance produced substantially lower EERs. This is in stark contrast to the eigensurface method, in which Euclidean and cosine measures seem tailored to specific surface representations. This suggests that incorporating LDA produces a surface space with predominantly radial between-class variance, regardless of the surface representation, whereas when using PCA alone, this relationship is dependant on the type of surface representation used. In summary, we have managed to reduce error rates from 15.3% EER using initial depth maps, to an EER of 11.3% using a horizontal gradient representation. This improvement over the best eigensurface system shows that incorporation of LDA improves performance in three-dimensional as well as two-dimensional face recognition approaches. Given that the 3D capture method produces face models invariant to lighting conditions and provides the ability to recognise faces regardless of pose, this system is particularly suited for use in security and surveillance applications.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
Zhao, W., Chellappa, R.: 3D Model Enhanced Face Recognition. In Proc. of the Int. Conf. on Image Processing, Vancouver (2000) Phillips, P.J., Grother, P., Micheals, R.J., Blackburn, D.M., Tabassi, E., Bone, J.M.: FRVT 2002: Overview and Summary. http://www.frvt.org/FRVT2002/documents.htm, (2003) Romdhani, S., Blanz, V., Vetter, T.: Face Identification by Fitting a 3D Morphable Model using Linear Shape and Texture Error Functions. The ECCV (2002) Blanz, V., Romdhani, S., Vetter, T.: Face Identification across Different Poses and Illuminations with a 3D Morphable Model. In Proc. of the 5th IEEE Conf. on AFGR (2002) Beumier, C., Acheroy, M.: Automatic 3D Face Authentication. Image and Vision Computing, Vol. 18, No. 4, (2000) 315-321 Beumier, C., Acheroy, M.: Automatic Face Verification from 3D And Grey Level Clues. In Proc. Of the 11th Portuguese Conference on Pattern Recognition (2000) Hesher, C., Srivastava, A., Erlebacher, G.: Principal Component Analysis of Range Images for Facial Recognition. In Proc. CISST (2002) Heseltine, T., Pears, N., Austin, J.: Three-Dimensional Face Recognition: An Eigensurface Approach. In Proc. of the International Conference on Image Processing (2004) Belhumeur, P., Hespanha, J., Kriegman, D.:Eigenfaces vs. Fisherfaces: Face Recognition using class specific linear projection. The European Conference on Computer Vision, (1996) The 3D Face Database, The University of York. www.cs.york.ac.uk/~tomh Adini, Y., Moses, Y., Ullman, S.: Face Recognition: the Problem of Compensating for Changes in Illumination Direction. IEEE Trans. on Pattern Analysis and Machine Intelligence, (1997) 721-732 Heseltine, T., Pears, N., Austin, J.: Evaluation of image pre-processing techniques for eigenface based face recognition. In Proc. of the 2nd International Conference on Image and Graphics, SPIE vol. 4875, 677-685 (2002) Heseltine, T., Pears, N., Austin, J., Chen, Z.: Face Recognition: A Comparison of Appearance-Based Approaches. In Proc. VIIth Digital Image Computing: Techniques and Applications (2003)
Face Recognition Using Improved-LDA Dake Zhou and Xin Yang Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China {normanzhou, yangxin}@sjtu.edu.cn
Abstract. This paper introduces an improved-LDA (I-LDA) approach to face recognition, which can effectively deal with the two problems encountered in LDA-based face recognition approaches: 1) the degenerated generalization ability caused by the “small sample size” problem, and 2) Fisher criterion is nonoptimal with respect to classification rate. In particular, the I-LDA approach can also improve the classification rate of one or several appointed classes by using a suitable weighted scheme. The key to this approach is to use the directLDA techniques for dimension reduction and meanwhile utilize a modified Fisher criterion that it is more closely related to classification error. Comparative experiments on ORL face database verify the effectiveness of the proposed method.
1
Introduction
Face recognition (FR) techniques could be roughly categorized into two main classes: feature-based approaches and holistic-based approaches [1]. Among various FR techniques, the most promising approaches seem to be those holistic-based approaches, since they can avoid difficulties of facial shape or features detection encountered in the feature-based approaches. For holistic-based approaches, feature extraction techniques are crucial to their performance. Linear discriminant analysis (LDA) and principle component analysis (PCA) are the two most used tools for feature extraction in holistic-based approaches, e.g., the famous Fisherfaces [2] and Eigenfaces [3] are based on the two techniques, respectively. LDA, based on Fisher criterion to seek the projection which maximizes the ratio of the between- and within- class scatters, is a well-known classical statistical technique for dimensionality reduction and feature extraction [4]. Therefore, it is generally believed that, for the FR problem, LDA-based algorithms outperform PCAbased ones, since the former exploits the class information to build the most discriminatory features space for classification while the latter achieves simply object reconstruction in the sense of mean-square error. Belhumeur et al. first suggested a LDA-based approach to face recognition, which is also referred to as Fisherfaces [2]. This work was partially supported by National Natural Science Foundation of China (No.30170264), National Grand Fundamental Research 973 Program of China (No.2003CB716104). A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 692–699, 2004. © Springer-Verlag Berlin Heidelberg 2004
Face Recognition Using Improved-LDA
693
Inspired of the success of Fisherfaces, at present there are many LDA extension approaches that try to find more effective features subspace for FR, such as directLDA (D-LDA) [5,6], Enhanced Fisher linear discriminant Model (EFM) [7], etc. Although LDA has been successfully used for FR tasks in many cases, there are still two problems in LDA-based FR approaches [2,5,8,9]. One is the degenerated generalization ability caused by the so-called “small sample size” (SSS) problem, which widely exists in FR tasks because the number of training samples (typically less than 10 per person) is smaller than the dimensionality of the samples (typically larger than One solution to the SSS problem is the “regularity techniques” that add some small perturbation to the with-in class scatter matrix [8]. Another option is the one that discards the null space of the within-class scatter matrix as a preprocessing step for dimension reduction [2]. However, the discarded subspace may contain significant discriminatory information. Recently, direct-LDA (D-LDA) methods for face recognition are presented, in which the null space of the betweenclass scatter matrix or the complement space of the null space of the within-class scatter matrix, containing no significant discriminatory information, is discarded [5,6]. Another problem encountered in LDA-based approaches is that the traditional Fisher separability criterion is nonoptimal with respect to classification rate in multiclass case. Loog et al. proposed a weighted LDA (W-LDA) method using an “approximation weighted pairwise Fisher criteria” to relieve this problem [9]. But this method cannot be directly applied in high-dimensional patterns, such as face images, because of its computational complexity and the existence of the SSS problem. This paper introduces an improved-LDA (I-LDA) approach for face recognition, which relieves the above two problems to a great extent. In particular, the I-LDA approach can also improve the classification rate of one or several appointed classes by using an appropriate weighted scheme. The proposed approach first lower the dimensionality of the original input space by discarding the null space of the betweenclass scatter matrix containing no significant discriminatory information. After introducing weighted schemes into the reconstruction of the between- and withinclass scatter matrix in the dimension reduced subspace, a modified Fisher criterion is obtained by replacing the within-class scatter matrix in the traditional Fisher separability criterion with the total-class scatter matrix. LDA using the modified criterion is then implemented to find lower-dimensional features with significant discrimination power. Finally, the nearest neighbor (to the mean) rule and Euclidean distance measure are used for classification. Experimental results on ORL face database show that the proposed approach is an effective method for face recognition.
2
Review of LDA
The problem of feature extraction in FR can be stated as follows: Given a set of N training face each of which is represented as an n-dimensional column vector. Let denote the classes. The objective is to find a transformation T, based on optimization of certain separability criterion, to produce a low-dimensional feature vector with significant discriminatory power, such that:
694
D. Zhou and X. Yang
LDA is one of the widely used linear feature extraction techniques in the FR community, which is also referred to as Fisher linear Discriminant Analysis (FLD). Let and denote the within- and between- class scatter matrices in the input space, respectively. The goal of LDA is to find a set of basis vectors, denoted as W that maximizes the Fisher criterion function J(W) defined as:
Suppose matrix is nonsingular, the criterion function J(W) can be maximized when W consists of the eigenvectors of the matrix Unfortunately, the matrix is often singular in FR tasks because of the existence of the SSS problem. As a result, LDA overfits to the training data and thus generalizes poorly to new testing data. Additionally, the traditional Fisher criterion defined by Eq. (1) is not directly related to classification rate in multiclass case.
3
Improved-LDA (I-LDA)
The proposed I-LDA approach, which uses the D-LDA techniques for dimensionality reduction while at the same time utilizes weighted schemes to obtain a modified Fisher criterion that it is more closely related to classification error, can effectively deal with the above two problems encountered in traditional LDA-based approaches. In particular, the I-LDA can also improve the classification rate of one or several appointed classes by using a suitable weighted scheme. Fig.1 gives a conceptual overview of this algorithm.
Fig. 1. Flow chart of the I-LDA algorithm.
3.1 Dimensionality Reduction Since those significant discriminatory information are in the null space of or the complement space of the null space of [5,6], one can safely discard the null space of without losing useful information. To remove null space of we first diagonalize Where t denotes the transpose operator,
is the eigenvector
matrix of and is the diagonal eigenvalue matrix of with diagonal elements in decreasing order. We can then obtain matrices
Face Recognition Using Improved-LDA
695
and
such that: Now, project the training samples from the origin input space into the dimensionality reduced subspace spanned by vectors
It should be noted that the direct eigen-decomposition of is very difficult or impossible since its dimensionality is very high (typically larger than Fortunately, can be rewrited as: where and M are the means of the classes and the grand mean of the training samples, and is the priori probability of the i-th class. According to the singular-value-decomposition (SVD) principle, the first m eigenvectors of which correspond to nonzero eigenvalues, can be indirectly computed by using an eigenanalysis on the matrix As is a K × K matrix, its eigenanalysis is affordable.
3.2 Weighted Schemes and Modified Criterion Loog et al. have shown that the traditional Fisher criterion defined by Eq. (1) is not directly related to classification error in multiclass case [9]. They also demonstrated that the classes with larger distance to each other in output space are more emphasized while the Fisher criterion is optimized, which leads that the resulting projection preserves the distance of already well-separated classes, causing a large overlap of neighboring classes. To obtain a modified criterion that it is more closely related to classification error, weighted schemes should be introduced into the traditional Fisher criterion to penalize the classes that are close and then lead to potential misclassifications in the output space. However, we would like to keep the general form of Eq. (1) because then the optimization can be carried out by solving a generalized eigenvalue problem without having to resort to complex iterative optimization schemes. Therefore, in this paper, simple weighted schemes are introduced into the reconstruction of the between-class scatter matrix in the dimensionality reduced subspace, which is different to the one used in [9]. The weighted between-class scatter matrix is redefined as follows:
where is the mean of the i-th class and is the Mahanalobis distance between the i-th class and j-th class in the dimensionality reduced subspace. The weighted function is a monotonically decreasing function of the distance with the constraint that it should drop faster the square of
696
D. Zhou and X. Yang
Additionally, correct coefficients scatter matrix
are introduced into the weighted within-class
defined as:
where E(·) denotes the expectation operator, designed
to
describe
and
the “important degree” In general case,
are the correct coefficients of
the
i-th
class,
and But in
special case, in which we have special interest in the i-th class and want to improve its classification rate, we can achieve this by increasing its corresponding correct coefficients to force the resulting projection preferring to the class. Note that the improvement of the classification rate of one or several special classes will in turn increase the whole classification error and we will demonstrate this in our experiments. As the within-class scatter matrix may be singular in the dimensionality reduced subspace, we further replace the within-class scatter matrix in traditional Fisher criterion with the total-class scatter matrix. Finally, the Fisher criterion is modified as:
where the total-class scatter matrix It is easy to prove that the projection defined by Eq. (7) can always maximize
because of that maximizes the modified criterion [8].
3.3 Overall Optimal Transformation Matrix When the projection consists of the eigenvectors of the matrix criterion defined by Eq. (7) is maximized:
the
where is the corresponding diagonal eigenvalue matrix of with diagonal elements in decreasing order. To further reduce the dimensionality to l, should only consists of the first l eigenvectors, which correspond to the first l largest eigenvalues Therefore, the overall optimal transformation matrix T is:
Face Recognition Using Improved-LDA
4
697
Experiment Results
We use the publicly available ORL face database to evaluate the I-LDA approach. The ORL database contains 400 face images of 40 distinct subjects. Ten images are taken for each subject, and there are variations in facial expression, details and pose, but few illumination variations. The images are 256 grayscale levels with a resolution of 112 × 92. Fig. 2 illustrates some example images used in our experiment.
Fig. 2. Some example face images in ORL face databases.
The effect of the I-LDA subspace is first illustrated in Fig. 3, where the first two most significant features of each image extracted by D-LDA and I-LDA, respectively, are visualized. One can see from this figure that the separability of subjects is greatly improved in the I-LDA-based subspace.
Fig. 3. The distribution of 50 face images of five subjects (classes) selected from the ORL database in D-LDA (Left) and I-LDA (Right) subspaces.
We also compared the performance of five holistic-based face recognition methods, including the proposed I-LDA method, the D-LDA method, the EFM method, the famous Eigenfaces and Fisherfaces. Note that since in this paper we focus only on feature extraction techniques, a simple classifier, i.e., the nearest neighbor (to the mean) classifier with Euclidean similarity (distance) measure is used for classification. Fig.4 (a) shows the classification rate curves of the five methods with respect to the dimensionality of features while 5 face images per person are selected randomly for training. The proposed method outperforms than the other four methods.
698
D. Zhou and X. Yang
In particular, our method achieves 94.8% recognition accuracy while only 27 features are used. The classification rate curves of the five methods are also shown in Fig.4 (b) as functions of the number of training samples per person. One can see from this figure that our proposed method also performs the best among the five methods. The Eigenfaces outperforms the remaining three methods when there are only 2 training samples per person, because of the existence of the SSS problem.
Fig. 4. Comparative results on ORL face database.
The final series of experiments verify the fact that the proposed method can improve the classification rate of one or several appointed classes. In normal case (5 training samples per person, 39 features, i,j=1,...,40, the classification accuracy of the 40-th subject in ORL database is 44%, while the overall classification accuracy is 93.9%. If the correct coefficients (j=1,..., 39)are set as 4 and are set as 5, the classification accuracy of the 40-th subject is 76%, while the overall classification accuracy is 84.6%. That is, the improvement of classification rate of one or several appointed classes is at cost of the degeneration of classification rate of the remaining classes.
5
Conclusions
Feature extraction is a key step for holistic-based face recognition approaches. In this paper, a LDA extension technique called improved-LDA (I-LDA), is proposed for face recognition. The proposed method, which combines the strengths of the D-LDA and W-LDA approaches while at the same time overcomes their disadvantages and limitations, can effectively find the significant discriminatory features for face
Face Recognition Using Improved-LDA
699
recognition. In particular, the I-LDA approach can also improve the classification rate of one or several appointed classes. Experiments on ORL and Yale face databases show that the proposed approach is an effective method for face recognition. Additionally, the I-LDA can also be used as an alternative of LDA, for the highdimensional complex data consisting of many classes, such as face images.
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
Chellappa, R., Wilson, C.L., Sirohey, S.: Human and machine recognition of faces: a survey. Proc. IEEE, Vol. 83 (1995) 705–740 Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Patt. Anal. Mach. Intell., Vol. 19 (1997) 711720 Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cognitive Neurosci., Vol. 3 (1991) 71-86 Jain, A.K., Duin, R., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Patt. Anal. Mach. Intell., Vol. 22 (2000) 4-37 Chen, L. F., Mark Liao, H. Y., Ko, M.T., Lin, J.C., Yu, G.J.: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition, Vol. 33 (2000) 1713–1726 Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recognition, Vol. 34 (2001) 2067–2070 Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Processing, Vol. 11 (2002) 467-476 Liu, K., Cheng, Y. Q., Yang, J. Y., Liu, X.: An efficient algorithm for Foley–Sammon optimal set of discriminant vectors by algebraic method. Int. J. Pattern Recog. Artificial Intell., Vol. 6 (1992) 817–829 Loog, M., Duin, R.P.W., Haeb-Umbach, R.: Multiclass linear dimension reduction by weighted pairwise Fisher criteria. IEEE Trans. Pattern Anal. Machine Intell., Vol. 23 (2001) 762-766
Analysis and Recognition of Facial Expression Based on Point-Wise Motion Energy Hanhoon Park and Jong-Il Park Division of Electrical and Computer Engineering, Hanyang University, Haengdang-dong 17, Seondong-gu, Seoul, Korea
[email protected],
[email protected]
Abstract. Automatic estimation of facial expression is an important step in enhancing the capability of human-machine interfaces. In this research, we present a novel method that analyses and recognizes facial expression based on point-wise motion energy. The proposed method is simple because we exploit a few motion energy values, which is acquired by an intensity-based thresholding and counting algorithm. The method consists of two steps: analysis and recognition. At the analysis step, we compute the motion energies of facial features and compare them with each other to figure out the normative properties of each expression. We extract the dominant facial features related to each expression among facial features. At the recognition step, we perform rule-based facial expression recognition on arbitrary images using the results of analysis. We apply the proposed method to the JAFFE database and verify its feasibility. In addition, we implement a real-time system that recognizes facial expression very well under weakly-controlled environments.
1 Introduction Facial expressions convey non-verbal cues, which play a key role in both interpersonal communication and human-machine interfaces. Pantic et al. reported that facial expressions have a considerable effect on a listening interlocutor; the facial expression of a speaker accounts for about 55% of the effect, 38% of the latter is conveyed by voice intonation and 7% by the spoken words [13]. Although humans recognize facial expressions virtually without effort or delay, reliable expression recognition by machine is still a challenge. For the last years many researchers have endeavored to find the normative qualities of facial expression [5,10,11,15,16]. Most of them have been based on Facial Action Coding System (FACS), which was developed for emotion coding by Ekman et al. [1], and thus have relied on accurate estimate of facial motion and detailed geometric description of facial structure. However, it is not well-suited for facial expression recognition because facial characteristics display a high degree of variability. We present a novel method for analyzing and recognizing facial expression based on the motion energy of facial features, which is simpler but more intuitive than previous approaches. Unlike the previous approaches at facial expression recognition, A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 700–708, 2004. © Springer-Verlag Berlin Heidelberg 2004
Analysis and Recognition of Facial Expression
701
our method does not exploit a complicated and time-consuming algorithm, i.e. 3D modeling of human face or feature tracking and modeling but a simple intensitybased thresholding and counting algorithm. Our method does not rely on the heuristic system i.e. FACS. Instead, we demonstrate that an extremely simple, biologically plausible motion energy detector can accurately analyze and recognize facial expressions. There are some methods that use motion energy in the literature. Bobick et al. used the image of motion energy for identifying human movement [14]. They defined the motion energy image (similar to Motion History Image [12]) as the cumulative image of the regions where there is motion. Essa et al. also used motion energy for describing facial motion [16]. They defined motion energy as the amount of pixel-wise spatio-temporal variation. Our definition of motion energy is similar to Essa et al.. But we compute the amount of region-wise variation. Thus we count the pixels having large motion in a region and define the motion energy of the region as the counting number. It has an effect on reducing the sensitivity to environmental variation. Details will be given in the next section. This paper is organized as follows. Our approach aiming at analyzing and recognizing human facial expression is briefly introduced in Section 2. In Section 3, we will explain the system that analyses and recognizes the facial expression using the proposed method in detail. Experimental results are reported in Section 4. Conclusions are drawn in Section 5.
2 Method The method consists of two steps: analysis step, recognition step. At the analysis step, we aim at analyzing the normative properties of facial expression. We compute the contribution measures, i.e. motion energies, of facial features and compare them with each other. Finally, we extract the dominant facial features related to each expression among facial features. At the recognition step, we perform rule-based facial expression recognition on arbitrary images using the analysis result. For experimental convenience, we assume the following about the input image: (1) The image is an intensity image; (2) The image is a frontal face image; (3) The lighting condition of scene is kept in a scope. However, our method can be also applied to the face images that are captured under weakly-controlled environments as we will show in Section 4.
2.1 Facial Expression Analysis Our method of analysis assumes that the location of face and facial features is known. In the next section, we explain the methods used to find face or facial features within an image. The procedure of our analysis method is as follows. First, we acquire a difference image (DI) by subtracting a non-expressed (=neutral) image from an expressed image. Second, we segment the DI into rectangular sub-regions that surround facial features. Third, we compute the motion energy of the sub-regions using Eq. (1). Finally, when sorting the motion energy of each sub-region, the facial features associ-
702
H. Park and J.-I. Park
ated with the region having larger motion energy are determined to be a dominant feature of the expression. Fig. 1 shows the block diagram of our method for analyzing facial expressions. Fig. 2 shows the examples of DI and MEI. The pixel of MEI represents the amount of motion within a rectangle region having the position of the pixel as a center. In MEI, we do not use all the pixels but a few pixel values in the position of facial features.
In Eq. (1), denotes the sub-region that surrounds i-th facial feature, and S denotes the motion energy and the size of respectively, and th denotes a threshold value, which is determined according to the changing lighting condition. In (1), we alleviated the difference between the intensity values of the facial features by subtracting the average value from the intensity values.
Fig. 1. Block diagram of the proposed facial expression analysis method.
Fig. 2. Examples of DI and MEI. In MEI the pixel value represents motion energy within a rectangle region having the position of the pixel as the center.
2.2 Facial Expression Recognition The procedure for recognizing facial expression is much simpler. We computed the motion energy of in (1). If the expressed image associated with a facial expression e has large an arbitrary expressed image having large is voted for the
Analysis and Recognition of Facial Expression
703
facial expression e. Then the arbitrary image may be voted for usually 2 or 3 facial expressions, but more often for a particular facial expression. Finally, we determine the particular facial expression as one of the arbitrary image. Fig. 3 depicts the procedure of the recognition method.
Fig. 3. Block diagram of the proposed facial expression recognition method. The inside of Block A is shown in Fig. 1.
3 Facial Expression Analysis and Recognition System Our system consists of three parts: preprocessing, analysis, and recognition as shown in Fig. 4. In this section we explain only the preprocessing part because the others were explained in Section 2. In the preprocessing part, we detect the location of human face in a scene and facial features within the detected face region using off-theshelf algorithms.
Fig. 4. Facial expression analysis and recognition system.
3.1 Face Detection Detecting the location of human face is the first step of the facial expression analysis and recognition system. We use the object detector proposed by Viola [7] and improved by Lienhart [8]. A classifier (namely a cascade of boosted classifiers working with Haar-like features) is trained with a few hundreds of sample views of a face. After a classifier is trained, it is applied to a region of interest in an input image. In our implementation, we use the functions included in OpenCV library [12]. The sizes of the detected face regions are not same as each other due to individual characteristic. In order to minimize the individual difference, we rescale the face regions so that they have a fixed size.
704
H. Park and J.-I. Park
3.2 Facial Feature Detection Human facial feature detection is a significant but difficult task. Although many facial feature extraction algorithms have been proposed so far [2,4,6], they are complicated, time-consuming or need a prior knowledge or are applicable to only a color image. In this paper, we specify the position of each facial feature using a modified version of the method proposed by Lin and Wu [3]. The algorithm computes a cost for each pixel inside the face region using a generic feature template and selects pixels with the largest costs as facial feature points. It can be applied to an intensity (gray) image and is robust even when some sub-regions of the face exhibit low contrast. However, it is still time-consuming to apply the template to all pixels inside the face region [4]. Thus, we extract the valley regions [9] inside the face region and apply the template to each pixel inside the valley region. Fig. 5 shows some of the valley regions.
Fig. 5. Examples of the valley regions extracted from the face region.
Fig. 6. Facial feature detection. (b) The main facial features (eyes, mouth) and (c) the other facial features (eyebrows and nose) are detected within (c) four sub-regions. The center of each circle is the exact position of the features.
The procedure for extracting the facial features from the valley region consists of two stages. In the first stage we divide the face region into four sub-regions as shown in Fig. 6(c) and we extract the main facial features as shown in Fig. 6(a) within each sub-region. The sub-region is expected to contain the right eye, left eye, nose, mouth feature point respectively. In the second stage we extract the other facial features as shown in Fig. 6(b) based on the positions of the main facial features [3].
4 Experimental Results In our experiments, the facial expression images from JAFFE database [5] were used. The database includes the images that correspond to six facial expressions (happiness, sadness, surprise, anger, disgust, fear) and each of the images is an intensity image.
Analysis and Recognition of Facial Expression
705
The experimental process is as follows. First, we found the face regions in input images using the face detection algorithm explained in Section 3 and they were rescaled to 100×100. We then detected the facial features within the rescaled face region using the detection method explained before. Next, we specify the rectangle regions (width = height = 20) that surround the facial features. Finally, we computed the amount of motion energy of the rectangle regions using (1). Fig. 7 shows the process of the motion energy computation. The average of the motion energy values of facial features computed from images in the database is shown in Table 1. The gray cells have larger value than others, and thus mean the dominant facial feature that corresponds to each expression. Consequently, as given in Table 2, we can know the dominant facial features by which each expression is mainly influenced. The result is consistent with our intuition.
Fig. 7. Motion energy computation. (a) Neutral and expressed image, (b) DI, and (c) motion energy value of each facial feature. In our experiment, we didn’t compute a whole MEI but several pixel values at the positions related to the facial features.
706
H. Park and J.-I. Park
Using the result of analysis we can recognize the expression of an arbitrary facial image. The recognition is accomplished using Eq. (2). In (2), are the amounts of motion energy that corresponds to each facial feature and are the threshold values. The threshold values are heuristically determined using the result given in Table 1. After extracting the facial features from an arbitrary image, we computed their motion energies. The arbitrary image was voted by the facial feature of which e is larger than T. The overall recognition rate was approximately 70% as shown in Table 3. It is very accurate when taking the simplicity of the method into consideration.
We implemented a real-time system that recognizes facial expression from the facial images which are captured under weakly-controlled environments as shown in Fig. 8. Our system recognizes the facial expression very well unless the lighting condition or head position changes abruptly.
Fig. 8. Real-time system for recognizing facial expression.
Analysis and Recognition of Facial Expression
707
5 Conclusions We proposed a simple method that analyses and recognizes facial expression in realtime based on point-wise motion energy. The performance of the proposed method was acceptable (70% recognition accuracy) while the procedure was extremely simple. We quantitatively analyzed the normative qualities of facial expressions and found that the result was consistent with our intuition. Then we verified that it could robustly recognize the facial expression of an arbitrary image using a simple intensity-based threshold and counting algorithm under weakly-controlled environments. In this paper, we focused on the analysis of facial expression and thus used the simplest rule-based method at our recognition step. We expect that the performance would be enhanced if we were to use more sophisticated method for recognition. The result of the facial feature detection algorithm was slightly influenced by lighting conditions e.g. the self-shadowing in the nose area. The study on removing the effect of lighting condition would be valuable for future research. Acknowledgement. This work was supported by the research fund of Hanyang University.
References 1.
Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologists Press Inc. (1978) 2. Gordan, M., Kotropoulos, C., Pitas, I.: Pseudo-automatic Lip Contour Detection Based on Edge Direction Patterns. Proc. of ISPA’01 (2001) 138–143 3. Lin, C.-H, Wu, J.-L.: Automatic Facial Feature Extraction by Genetic Algorithms. IEEE Transactions on Image Processing, Vol. 8 (1999) 834–845 4. Rizon, M., Kawaguchi, T.: Automatic Eye Detection Using Intensity and Edge Information. Proc. of TENCON’00, Vol. 2 (2000) 415–420 5. Lyons, M.J., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding Facial Expressions with Gabor Wavelets. Proc. of FG’98 (1998) 200–205 6. Zhang, L., Lenders, P.: Knowledge-based Eye Detection for Human Face Recognition. Proc. of Fourth Intl. Conf. on Knowledge-Based Intelligent Engineering Systems & Allied Technologies (2000) 117–120 7. Violar, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. Proc. of CVPR’01, Vol. 1 (2001) 511–518 8. Lienhart, R., Maydt, J.: An Extended Set of Haar-like Features for Rapid Object Detection. Proc. of ICIP’02, Vol. 1 (2002) 900–903 9. Chow, G., Li, X.: Toward a System for Automatic Facial Feature Detection. Pattern Recognition, Vol. 26 (1993) 1739–1755 10. Cohn, J.: Automated Analysis of the Configuration and Timing of Facial Expression. What the face reveals ed.): Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS), Oxford University Press Series in Affective Science (2003)
708
H. Park and J.-I. Park
11. Chibelushi, C.C., Bourel, F.: Facial Expression Recognition: A Brief Tutorial Overview. CVonline: On-Line Compendium of Computer Vision (2003) 12. Open Source Computer Vision Library. Available: http://www.intel.com/research/mrl/research/opencv 13. Pantic, M., Rothkrantz, L.J.M.: Automatic Analysis of Facial Expressions: the State of the Art. IEEE Transactions on PAMI, Vol. 22 (2000) 1424–1445 14. Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. IEEE Transactions on PAMI, Vol. 23 (2001) 257–267 15. Krinidis, S., Buciu, I., Pitas, I.: Facial Expression Analysis and Synthesis: A Survey. Proc. of HCI’03, Vol. 4 (2003) 1432–1436 16. Essa, I., Pentland, A.: Coding, Analysis, Interpretation, Recognition of Facial Expressions. IEEE Transactions on PAMI, Vol. 19 (1999) 757–763
Face Class Modeling Using Mixture of SVMs Julien Meynet, Vlad Popovici, and Jean-Philippe Thiran Signal Processing Institute, Swiss Federal Institute of Technology Lausanne, CH-1015 Lausanne, Switzerland http://itswww.epfl.ch
Abstract. We 1 present a method for face detection which uses a new SVM structure trained in an expert manner in the eigenface space. This robust method has been introduced as a post processing step in a realtime face detection system. The principle is to train several parallel SVMs on subsets of some initial training set and then train a second layer SVM on the margins of the first layer of SVMs. This approach presents a number of advantages over the classical SVM: firstly the training time is considerably reduced and secondly the classification performance is improved, we will present some comparisions with the single SVM approach for the case of human face class modeling.
1 Introduction Human face detection is one of the most important tasks of the face analysis and can be viewed as a pre-processing step for face recognition systems. It is always important to find a precise localization of faces in order to be able to later recognize them. The difficulty resides in the fact that the face object is highly deformable and its aspect is also influenced by the environmental conditions. On the other hand, the class of objects which do not belong to the face class is large and can not be modeled. Thus finding a model for the face class is a challenging task. In the last years, many methods have been proposed, we give a brief overview of the most significant of them. A fast face detection alorithm has been proposed by Viola and Jones[1] , it uses simple rectangular Haar-Like features boosted in a cascade structure. We have used this fast approach as a pre-processing step in order to obtain a fast and robust face detection system. Then, one of the most representative approaches for the class of neural networks–based face detectors is the work reported by Rowley et. al. in [2]. Their system has two major components: a face detector made of a scanning window at each scale and position, and a final decision module whose role is to arbitrate multiple detections. Sung and Poggio have developed a clustering and distri-bution-based system for face detection [3]. There are two main components in their system: a model 1
The authors thank the Swiss National Science Foundation for supporting this work through the National Center of Competence in Research on “ Interactive Multimodal Information Management (IM2)”.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 709–716, 2004. © Springer-Verlag Berlin Heidelberg 2004
710
J. Meynet, V. Popovici, and J.-P. Thiran
of the face/non–face patterns distribution and a decision making module. The two class distributions are each approximated by six Gaussian clusters. A naive Bayes classifier based on local appearance and position of the face pattern at different resolutions is described by Schneiderman and Kanade in [4]. The face samples are decomposed in four rectangular subregions which are then projected to a lower dimensional space using PCA and quantized into a finite set of patterns. Osuna et. al. developed a face detector based on SVM that worked directly on the intensity patterns [5]. A brief description of the SVM is given in this paper also. The large scale tests they performed showed a slightly lower error rate than the system of Sung and Poggio, while running approximately 30 times faster. In [6], Popovici and Thiran proposed to model the face class using a SVM trained in eigenfaces space. They showed that even a very low dimensional space (compared with the original input space) suffices to capture the relevant information when used in conjunction with a powerful classifier, like a non linear SVM. We propose here an extension of these ideas that employs a mixture of SVMs (MSVM in the following) for better capturing the face class variability. We use the analysis from [6] for choosing the input space of our classifier, but we will also extend the feature vector by adding a new term that accounts for the information lost through the PCA process. The idea of using mixture of experts (in our case SVMs) is not new, but we will use a slightly different approach: the final decision is taken by a SVM that is trained using the margins output by the first layer of SVMs. In training this final SVM we penalize more the false negative type of errors (missed faces) to favor the detection of faces. Other ways of combining the experts can be used: for example, in [7] the EM algorithm was used to train the experts. Later [8] replaced neural network experts by SVMs but still trained each expert on the whole dataset. The use of parallel SVMs trained on subsets of large scale problem has been studied in 2002 in [9]. However, the second layer remained a neural network. We will introduce the MSVM and we will justify its use both from a theoretical perspective and a more practical one. In section 2 we will briefly review the SVM theory and then we will describe the MSVM approach. The MSVM will be trained on face and non face examples pre-processed by PCA, as described on section 2.3. Finally, in sections 3 and 4 we present some experiments and comparisons with classical SVM and we draw some conclusions.
2 2.1
Mixtures of SVMs An Overview of Classical SVM
Let us begin with a brief overview of the classical SVM algorithm. More information about SVM can be found in [10], [11]. Let be a set of examples. From a practical point of view, the problem to be solved is to find that hyperplane that
Face Class Modeling Using Mixture of SVMs
711
correctly separates the data while maximizing the sum of distances to the closest positive and negative points (i.e. the margin). The hyperplane is given by2:
and the decision function is
In the case of linearly separable data, maximizing the margins means to maximize or, equivalently, to minimize subject to Suppose now that the two classes overlap in feature space. One way to find the optimal plane is to relax the above constraints by introducing the slack variables and solving the following problem (using 2-norm for the slack variables):
where C controls the weight of the classification errors in the separable case). This problem is solved by means of Lagrange multipliers method. Let be the Lagrange multipliers solving the problem above, then the separating hyperplane, as a function of is given by
Note that usually only a small proportion of are non-zero. The training vectors corresponding to are called support vectors and are the only training vectors influencing the separating boundary. In practice however, a linear separating plane is seldom sufficient. To generalize the linear case one can project the input space into a higher–dimensional space in the hope of a better training–class separation. In the case of SVM this is achieved by using the so–called “kernel trick”. Basically, it replaces the inner product with a kernel function As the data vectors are involved only in this inner products, the optimization process can be carried out in the feature space directly. Some of the most used kernel functions are:
2.2
Mixture of SVMs (MSVM)
SVM techniques are well known since a few years for many reasons, among them their generalization capabilities. However, as explained in the previous 2
We use
to denote the inner product operator
712
J. Meynet, V. Popovici, and J.-P. Thiran
subsection, training a SVM usually requires solving a quadratic optimization problem, which means it also varies quadratically with the number of training examples. We know by experience that because of the large variability of both face and non face classes, building a face detection system requires a large amount of examples. So in order to make easier the training of the SVM (in term of training time) we use a parallel structure of SVMs similar to the one introduced in [9]. A first part of the dataset is splitted and clustered and each cluster is used to train each SVM of the first layer. And then the remaining example are used to train a second layer SVM, based on the margins of the first layer SVMs. Basically, the input space for the layer SVM is the space of margins generated by the layer SVMs. We can represent the output of such a mixture of M + 1 experts as follows:
where m(x) is the vector of margins output by the M SVMs in the first layer given the input x. Assuming that we want to train M SVMs in the first layer, we will need M + 1 training sets (an additional one is used to train the second layer SVM) see figures 1. We use two different approaches for generating the M + 1 subsets. One consists of a random partitioning of the original training set. The second one is more elaborated: we first randomly draw a sample that will be used for training the second layer and then we use a clustering algorithm, like k-Means[12] , for building the M subsets needed for training the first layer SVM. In both cases we train each SVM-L1-i using a cross-validation process to select the best parameters then we use the M + 1-th dataset for training the second layer SVM (SVM-L2): we let each of SVM-L1-i to classify the examples from this dataset and we take the margins output by the SVM-L1-i as input for SVM-L2. The margin can be seen as a measure of confidence in classifying an example, so, in some sense, the second layer SVM learns a non linear function that depends on the input vector and which assembles the confidences of each individual expert. From a practical point of view, we have decomposed a problem of complexity in M + 1 problems of complexity. As N >> M this decomposition is clearly advantageous, and has the potential of being implemented in parallel, reducing even more the training time. Another issue that should be mentionned here is related to the robustness of the final classifier. In the case of a single SVM, if the training set contains outliers or some examples heavily affected by noise, its performance can be degraded. However, the chances of suffering from such examples are less important in the case of MSVM.
2.3
Construction of the Eigenfaces
As we use a large number of examples, we use Principal Component Analysis(PCA) to decrease the dimensionality of the image space. We first recall the definition of PCA and then we will discuss some possible improvements.
Face Class Modeling Using Mixture of SVMs
713
Fig. 1. Training of the SVMs of the first ans second layer.
Principal Component Analysis (PCA) and Eigenfaces. Let be a set of vectors and consider the following linear model for representing them
where is a matrix, and For a given the PCA can be defined ([13]) as the transformation whose column vectors called principal axes, are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors are given by the dominant eigenvectors of the sample covariance matrix3 such that and where is the sample mean. The vector is the representation of the observed vector The projection defined by PCA is optimal in the sense that amongst the subspaces, the one defined by the columns of minimizes the reconstruction error where Now let us view an image as a vector in space by considering its pixels in lexicographic order. Then the PCA method can be applied to images as well, and in the case of face images the principal directions are called eigenfaces [14], [15]. Some details about the estimation of the eigenfaces space dimensionality such as classification in eigenfaces space using SVMs are shown in [6]. Distance From Feature Space (DFFS). Traditionally, the distance between a given image and the class of faces has been decomposed in two orthogonal components: the distance in feature space (corresponding to the projection onto the lower dimensional space) and the distance from feature space (DFFS) (accounting for the reconstruction error).
3
We denote with a prime symbol the transpose of a matrix or a vector.
714
J. Meynet, V. Popovici, and J.-P. Thiran
Given this and considering that the DFFS still contains some useful information for classification, we can improve the discrimination power by adding the value of the DFFS to the projection vector. Thus considering that we keep 85% of total variance with the first eigenvectors, we use the following vectors to perform the classification.
where space and
3
represent the projection onto the the DFFS.
eigenfaces
Experiments and Results
Our experiments have been based on images from the BANCA [16] and the XM2VTS[17] databases for the faces whereas the non faces examples were chosen by bootstrapping on randomly selected images. In order to test the accuracy and the validity of the method, we have used a dataset as follows: A training set made of 8256 faces and 14000 non face examples, all images with the fixed size 20 × 15 pixels. The validation set had 7822 faces and 900000 non faces of the same size. We first tried to find a coherent PC A decomposition before training the SVMs. The PCA reduces the dimensionality of the input space but also the eigenfaces proved to be more robust features in real-world applications than the raw pixel values. We first estimated the dimensionality of the eigenfaces space that we need to keep 85% of total variation. For this we have estimated the number of examples from which the eigenfaces space has a stable dimensionality for keeping 85% of total variation. So we performed the PCA decomposition on a randomly selected part of the training set and from the 300-dimensional input space we kept only 17 eigenfaces. As explained earlier, the vector used for the classification task is made by adding the DFFS value to the projection onto the eigenfaces space. Then, the face training set has been splitted into 2 subsets. The first part, containing 5000 examples, has been splitted into 5 subsets, either by clustering or by random sampling. We trained the SVM-L1-i on these 5 subsets, each combined with 2000 negative examples and the remaining subset (3000 faces and 4000 non faces) was passed through all the trained SVM-L1-i.The output margins were used to train the second layer SVM. Table 1 shows the classification results on the validation set for each SVM. Using the random sampling for generating the training sets for the first layer has the advantage of reducing the importance of outliers or unusual examples, but leads to SVMs that need more support vectors for good performances. On the other hand, using k-Means clustering leads to SVMs that perform like experts on their own domain, but whose common expertise should cover the full domain. It is interesting to see that the MSVM has better generalization capabilities than a single SVM trained on the initial dataset. This result shows that as explained in section 2, MSVM does not only give improvements in term of training time but also in term of classification performances. We can also notice the importance of the SVM-L2: The TER (Total error rate) has been improved from
Face Class Modeling Using Mixture of SVMs
715
a single SVM but it is really more interesting for face detection as it improves the true positive rate (even if the false positive rate is degraded). Just recall that in the face detection world, we often want to detect a maximum number of faces even if some non face examples are misclassified. Another advantage of this method compadred to the single SVM trained on the complete dataset is that the total number of support vectors (last column in table2) is radically inferior in the case of MSVM. This emphasizes the gain of time and computation complexity given by the MSVM.
4
Conclusions
In this paper we presented a method for face class modeling using mixtures of SVMs. This approach presents an extension to the SVM technique which allows a better use of particularly large datasets. We have used this mixture of experts approach in the context of face detection using a PCA decomposition and then adding the DFFS to the features in order to decrease the information loss through the PCA process. We have proposed here a mixture of SVMs made of several SVMs in a first layer trained on independent subsets of the initial dataset and a second layer trained on the margins predicted by the first layer SVMs given another independent subset. It has been shown that this structure allowed a significant improvement from the single SVM trained on the complete database. On the first hand, the training time is largely reduced because of the parallel structure and the splitting of the original subset, and on the other hand, the discrimination capabilities are improved because of the possible presence of noise and outliers in the dataset. In order to have a structure more adapted to
716
J. Meynet, V. Popovici, and J.-P. Thiran
the datasets, we are now working on more specialized experts, for example by using a clustering in eigenfaces space based on a more appropriated metrics.
References 1. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 2001. 2. H. A. Rowley, S. Baluja, and T. Kanade, “Human face detection in visual scenes,” in Advances in Neural Information Processing Systems (D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds.), vol. 8, pp. 875–881, The MIT Press, 1996. 3. K. Sung and T. Poggio, “Example-based learning for view-based human face detection,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 39–51, 1998. 4. H. Schneiderman and T. Kanade, “Probabilistic modeling of local appearance and spatial relationship for object recognition,” in Proceedings of Computer Vision and Pattern Recognition, pp. 45–51, 1998. 5. E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Proceedings of Computer Vision and Pattern Recognition, 1997. 6. V. Popovici and J.-P. Thiran, “Face Detection using SVM Trained in Eigenfaces space,” in Proceedings of the 4th International Conference on Audio- and VideoBased Biometric Person Authentication, pp. 925–928, 2003. 7. R. A. J. M. I. J. S. J. Nowlan and G. rey E. Hinton, “Adaptive mixtures of local experts,” in Neural Computation 3(1), 1991, pp. 79–87, 1991. 8. J. T. Kwok, “Support vector mixture for classification and regression problems,” in Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 255–258, 1998. 9. R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of svms for very large scale problems,” 2002. 10. V. Vapnik, The Nature of Statistical Learning Theory. Springer Verlag, 1995. 11. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. 12. Darken, C., and Moody, J., “Fast adaptive k-means clustering: Some empirical results,” 1990. 13. H. Hotteling, “Analysis of a complex of statistical variables into principal components,” Journal of Educational Psychology, no. 24, pp. 417–441, 498–520, 1933. 14. L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization of human faces,” Journal of the Optical Society of America A, vol. 4, pp. 519–524, 1987. 15. M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991. 16. S. Bengio, F. Bimbot, J. Mariéthoz, V. Popovici, F. Porée, E. Bailly-Baillière, G. Matas, and B. Ruiz, “Experimental protocol on the BANCA database,” IDIAPRR 05, IDIAP, 2002. 17. K Messer and J Matas and J Kittler and J Luettin and G Maitre, “XM2VTSDB: The Extended M2VTS Database,” in Second International Conference on Audio and Video-based Biometric Person Authentication, 1999.
Comparing Robustness of Two-Dimensional PCA and Eigenfaces for Face Recognition Muriel Visani, Christophe Garcia, and Christophe Laurent France Telecom R&D - DIH/HDM 4, rue du Clos Courtel 35512 Cesson-Sévigné Cedex - France
[email protected]
Abstract. In this paper, we aim at evaluating the robustness of 2DPCA for face recognition, and comparing it with the classical eigenfaces method. For most applications, a sensory gap exists between the images collected and those used for training. Consequently, methods based upon statistical projection need several preprocessing steps: face detection and segmentation, rotation, rescaling, noise removal, illumination correction, etc... This paper determines, for each preprocessing step, the minimum accuracy required in order to allow successful face recognition with 2DPCA and compares it with the eigenfaces method. A series of experiments was conducted on a subset of the FERET database and digitally-altered versions of this subset. The tolerances of both methods to eight different artifacts were evaluated and compared. The experimental results show that 2D-PCA is significantly more robust to a wide range of preprocessing artifacts than the eigenfaces method.
1 Introduction During the last decade, automatic recognition of human faces has grown into a key technology, especially in the field of multimedia indexing and video surveillance. In this context, the views of the face to recognize can differ from the training set in the exact location of the face, the head pose, the distance to the camera, the quality of the images, the lighting conditions and partial face occlusions due to the presence of accessories such as eyeglasses or scarf. These different factors can affect the matching process. Many face recognition algorithms [1,2,3] have been developed. Most of these techniques need accurate face detection / localization [4,5] and normalization [6]. This last step ensures that the face to recognize is aligned with those used for training. Normalization is a critical step that can generate many artifacts. Statistical projection methods, such as the eigenfaces method [1] are among the most widely used for face recognition. The eigenfaces method is based on Principal Component Analysis (PCA), and have shown good performance on various databases. Very recently, Yang et al. [3] have introduced the concept of Two-Dimensional PCA (2D-PCA), and have shown that it provides better results than the eigenfaces method on three well-known databases. Lemieux et al. [7] A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 717–724, 2004. © Springer-Verlag Berlin Heidelberg 2004
718
M. Visani, C. Garcia, and C. Laurent
have evaluated the robustness of the eigenfaces method to normalization artifacts on the AR Face Database [8]. They have shown that the eigenfaces method is robust up to a certain point over a representative range of errors. Passed this point, the performances can decrease dramatically. They have also shown that the eigenfaces method hardly deals with usual artifacts such as translation errors. Indeed, a misalignment of 5% can reduce the recognition rates by 40%. The aim of this paper is to evaluate the robustness of 2D-PCA and compare it to the robustness of the eigenfaces method on a subset of the FERET1 [9] face database. The following parameters are tested: image rotation, scaling, vertical and horizontal translations, gaussian blurring, addition of white noise and partial occlusions of the face. The first four parameters model the effects of an inaccurate localization of the facial features, while gaussian blurring and addition of white noise simulate respectively poor resolution and low quality images. The paper is organized as follows. Section 2 details the 2D-PCA method. Section 3 describes our experimental protocol. Experimental results and in-depth analysis are given in Section 4. Section 5 concludes this paper.
2
Brief Description of Two-Dimensional PCA
While the eigenfaces method [1] is a baseline technique, Two-Dimensional PCA [3] is a very recent approach that we propose to describe in this section. The model is constructed from a training set containing images. While the eigenfaces approach considers an image with rows and columns as a vector of size (by concatenating its rows of pixels), 2D-PCA keeps the 2D structure of an image by considering it as a matrix of pixels, with rows and columns. The goal is to obtain a set of vectors of size so that the projection of the training set on P explains the best the scatter of the training set. These vectors will be referred to as 2D-components in the following. The projected matrix of on P is P, where is the matrix of the image of the training set and P is the matrix whose columns are the 2D-components. Yang et al. [3] introduced the maximization criterion where is a generalized covariance matrix of the projected image matrices
with Yang shows that the criterion J(P) equals S is a generalized covariance matrix of the image matrices
where
where is the mean matrix of all the images of the training set. It can be shown [3] that the vectors maximizing the criterion J(P) are the eigenvectors of S with largest eigenvalues. 1
Portions of the research in this paper use the FERET database of facial images collected under the FERET program.
Comparing Robustness of Two-Dimensional PCA
719
Face images are compared after projection on P. Yang et al. proposed the following distance between projected face images and
where denotes the Euclidian norm and of image matrix on the projection vector
3
is the projected vector
Description of the Experiments
In this section, we propose to compare the robustness of 2D-PCA using distance (3), with the robustness of the classical eigenfaces approach using distance. Yang et al. [3] have proven that image feature extraction is computationally more efficient using 2D-PCA than using the eigenfaces method. They have also shown that 2D-PCA gives better recognition rates than the eigenfaces method in the presence of variations over time, variations in the sample size, facial expressions, lighting conditions and pose. They experimented on three correctly normalized face databases excluding FERET. In order to study independently the effects of inaccuracies in the normalization steps, we performed our experiments on a subset of the FERET database, and digitally-modified versions of this subset. The subset used for training contains 200 pictures of 200 persons (one view per person). Most of the subjects have a neutral facial expression. None of them wear eyeglasses. An example is given in Fig.1(a). For each image, the positions of the eyes are known; they are used to perform face normalization, in five steps: 1. detecting and localizing the eyes in the image; 2. rotating the image so that the eyes are horizontally aligned; 3. scaling the image so that the distance between the eyes is set to 70 pixels; 4. cropping the image to a size of 130 pixels wide by 150 pixels high; 5. equalizing the histogram of the face image. A successful normalization is illustrated in Fig.1(b). To simulate the effects of disturbing events, we have defined 8 parameters illustrated in Fig.1(c-j). The first four parameters simulate the effects of imprecise eye localization. Vertical and horizontal translations: when cropping the images, an inaccurate feature detection can lead to translations of the face in the image. Horizontal translation varies from -30 to 30 pixels (23% of the total width), positive values corresponding to translations to the right. Vertical translation varies from -19 to 19 pixels (12.7% of the total height), positive values corresponding to translations to the top. Rotation: a central rotation whose center is located exactly at the middle of the eyes is applied after face normalization. The rotation angle varies from 1 to 19 degrees, clockwise; Scaling: the difference between the observed inter-eye distance and the target distance (i.e. 70 pixels) is varied from -20% to 20%; positive values correspond to zooming in and negative values to zooming out.
720
M. Visani, C. Garcia, and C. Laurent
Fig. 1. (a) Original image (FERET database). (b) Correctly normalized image (size 150 × 130). (c) Horizontal translation (22 pixels). (d) Vertical translation (4 pixels). (e) Rotation (8 degrees clockwise). (f) Scaling (-7%). (g) Blurring (h) Additive Gaussian white noise (i) Scarf (47 pixels). (j) Glasses
In an uncontrolled environment, depending on the distance between the camera and the subject, the resolution of the face image to recognize can be much lower than the resolution of the training images. One solution is to digitally zoom on the corresponding face. Zoom results in an interpolation of the missing pixels leading to blur the image; this phenomenon is simulated by the following parameter. Blurring: the image is convolved with a gaussian filter, whose standard deviation is varied from 0.5 to 9.5. Images acquired through real cameras are always contaminated by various noise sources, but if the systematic parts of the measurement error are compensated for, the error can be assumed to be additive Gaussian white noise, simulated by the following parameter. White noise: Gaussian White noise is added to the whole face image; its standard deviation is varied from 1 to 90. Let us finally consider the effects of occlusions. Some of the most usual occlusions are due to the presence of eyeglasses or of a scarf hiding the inferior part of the face. The glasses can be more or less dark, and the scarf can be more or less raised on the face. The following two parameters simulate these occlusions. Scarf: a black strip is added to the face image. It covers all the surface of the image from the bottom to a given height, varied from 1 to 80 pixels (more than 53% of the total height). Glasses: two black ellipses of width 3 pixels, whose centers are the centers of eye pupils and whose axial lengths are 28 and 18 pixels, are added to the face image. They are connected by a black strip of size 3×17 pixels. Each pixel inside one of these ellipses is replaced by where is the mean of all the pixels of the original image. is varied from 0 to 1; its increase results in darkening the interior of the ellipses. Near the glasses are completely black.
Comparing Robustness of Two-Dimensional PCA
721
Our aim is to study the effects of each of these parameters separately. Therefore, for a given experiment, only one parameter is tuned. The training set is the subset of the FERET database previously described, correctly normalized thanks to precise eye positions. Each test set corresponds to a fixed value of a given parameter applied to the training set, and therefore contains 200 digitally modified images of the training set, of the same size 150 × 130 pixels.
4
Experimental Results
In order to obtain the best performances, for both techniques and for each parameter, we first studied the number of projection vectors providing the best recognition rates. The projection vectors are sorted by their associated eigenvalue in descending order, and the first are selected. In Fig.2-3, the number of selected projection vectors is systematically given after the name of the algorithm used (eg. 2D-PCA(6) means that 6 2D-components have been selected to implement the 2D-LDA algorithm). Even though most of our experiments highlighted an optimal number of projection vectors, it can be noticed that, for some parameters, recognition rates grow with This phenomenon, often observed with eigenfaces, is illustrated in Fig.2(a). Concerning horizontal translations, the best results were obtained with only one 2D-component and decreased dramatically when using more 2D-components. This phenomenon, illustrated in Fig.2(b), is very interesting and opens the way to a normalization process using 2D-PCA. To evaluate the robustness of both methods, we studied the variations of the recognition rates when each parameter is tuned independently. From Fig.3, we can extract the tolerance ranges to each parameter, given in Table 1. The tolerance range to a parameter is the variation range of this parameter within which the recognition rates are greater than 95%. Concerning horizontal translations (see Fig.3(a)), 2D-PCA is much more robust to horizontal translations than the eigenfaces method. The tolerance range for 2D-PCA with only the first 2D-component is [–20, 22] (about 17% of the total width) and only [–6, 6] (4,6%) for 70 eigenfaces. When adding more 2D– components (see Fig.2(b)), the recognition rates decrease but are still better than the recognition rates provided by the eigenfaces method. Fig.3(b) shows that, for vertical translations, the optimal number of projection vectors is 90 eigenfaces against only 13 2D-components; however 2D-PCA achieves much greater recognition rates. 2D-PCA’s tolerance range is [–4,4] (2.7% of the total height) against [–3,3] (2%) for the eigenfaces method. Tolerance range to rotation (see Fig.3(c)) for 2D-PCA is [0,8] with only 4 2D-components and [0,6] with 90 eigenfaces. The recognition rates of 2D-PCA are significantly greater than those of the eigenfaces method with the rotation angle varying from 1 to 19 degrees. When studying the robustness to scaling (see Fig.3(d)), we can notice that both methods appear to be as robust to zooming in as to zooming out. 2D-PCA is more robust to scaling than the eigenfaces method. Using only 13 2D-components provides better results than using 90 eigenfaces: 2D-PCA has a tolerance of ±7% to scaling, while the eigenfaces method’s tolerance is ±6%.
722
M. Visani, C. Garcia, and C. Laurent
Fig. 2. (a) Effects of scaling on PCA. Recognition rates grow with the number of selected eigenfaces, until reaches 90. (b) Effects of horizontal translation on 2DPCA. The best recognition rates are obtained with the first 2D-component only. The recognition rates decrease when more 2D-components are added.
Fig.3(e) shows that 2D-PCA is much more robust to blurring, with a tolerance of 5.5, than the eigenfaces method for which tolerance is only 4. Fig.3(f) shows that both techniques are very robust to additive white noise. Recognition rates for both techniques are very close to 100% with varying from 0 to 90, which corresponds to a strong additive noise, as shown in Fig.1(h). From Fig.3(h-i) we can conclude that 2D-PCA is significantly more robust to partial occlusions than the eigenfaces approach. While 2D-PCA tolerates a 47 pixel scarf, the eigenfaces only tolerate a 22 pixel occlusion (improvement of about 114%). Concerning glasses, from 9 2D-components to 20, the recognition rates are 100% when is varied from 0.05 to 1, while the tolerance range of the eigenfaces method is at most 0.15, with the optimal number of 90 eigenfaces.
Comparing Robustness of Two-Dimensional PCA
723
Fig. 3. Compared recognition rates of 2D-PCA and eigenfaces when each parameter is tuned independently.
724
5
M. Visani, C. Garcia, and C. Laurent
Conclusion
Two-Dimensional PCA has proven to be efficient for the task of face recognition as well as computationally more efficient than the eigenfaces method [3]. However, like every statistical projection technique, it requires several preprocessing steps, that can generate various artifacts. Our aim was to determine the minimum accuracy required for 2D-PCA to provide efficient recognition. The robustness of 2D-PCA was compared to the robustness of the classical eigenfaces method, on a subset of the well-known FERET database. Experimental results have shown that 2D-PCA is more robust than the eigenfaces method over a wide range of normalization artifacts, overall translations, rotation of the face in the plane of the image, scaling, blurring and partial occlusions of the face. Some of our very recent experiments tend to show that 2D-PCA is also more robust to in-depth rotations than the eigenfaces method (recognition rates are improved of about 9% until a 30 degree rotation). Therefore, assuming that the efficiency of the preprocessing algorithm is within the tolerance ranges given in this paper, 2D-PCA can be applied successfully to face recognition in an unconstrained environment such as video indexing or video surveillance.
References 1. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3 (March 1991) 71–86. 2. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computing Survey, 35(4) (2003) 399–458. 3. Yang, J., Zhang, D., Frangi, A.F.: Two-Dimensional PCA: A New Approach to Appearance-Based Face Representation and Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1) (January 2004) 131–137. 4. Yang, M.H., Kriegman, D., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1) (2002) 34–58. 5. Garcia, C., Delakis, M.: Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection. To appear in IEEE Transaction of Pattern Analysis and Machine Intelligence (2004). 6. Reisfeld, D., Yeshurun, Y.: Preprocessing of Face Images: Detection of Features and Pose Normalization. Computer Vision and Image Understanding 71(3) (September 1998) 413–430. 7. Lemieux, A., Parizeau, M.: Experiments on Eigenfaces Robustness. Proc. International Conf. on Pattern Recognition (ICPR) (2002). 8. Martinez, A.M., Benavente, R.: The AR Face Database. CVC Technical Report 24 (June 1998). 9. Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.: The FERET Database and Evaluation Procedure for Face Recognition Algorithms. Image and Vision Computing 16(5) (1998) 295–306.
Useful Computer Vision Techniques for Human-Robot Interaction O. Deniz, A. Falcon, J. Mendez, and M. Castrillon Universidad de Las Palmas de Gran Canaria Departamento de Informática y Sistemas Edificio de Informática y Matemáticas. Campus de Tafira, 35017, Las Palmas, Spain {odeniz,afalcon,jmendez,mcastrillon}@dis.ulpgc.es
Abstract. This paper describes some simple but useful computer vision techniques for human-robot interaction. First, an omnidirectional camera setting is described that can detect people in the surroundings of the robot, giving their angular positions and a rough estimate of the distance. The device can be easily built with inexpensive components. Second, we comment on a color-based face detection technique that can alleviate skin-color false positives. Third, a simple head nod and shake detector is described, suitable for detecting affirmative/negative, approval/dissaproval, understanding/disbelief head gestures.
1 Introduction In the last years there has been a surge in interest in a topic called social robotics. As used here, social robotics does not relate to groups of robots that try to complete tasks together. For a group of robots, communication is simple, they can use whatever complex binary protocol to “socialize” with their partners. For us, the adjective social refers to humans. In principle, the implications of this are much wider than the case of groups of robots. Socializing with humans is definitely much harder, not least because robots and humans do not share a common language nor perceive the world (and hence each other) in the same way. Many researchers working on this topic use other names like human-robot interaction or perceptual user interfaces. However, as pointed out in [1] we have to distinguish between conventional human-robot interaction (such as that used in teleoperation scenarios or in friendly user interfaces) and socially interactive robots. In these, the common underlying assumption is that humans prefer to interact with robots in the same way that they interact with other people. Human-robot interaction crucially depends on the perceptual abilities of the robot. Ideal interaction sessions would make use of non-invasive perception techniques, like hands-free voice recognition or computer vision. Hands-free voice recognition is a topic that is still under research, being the most attractive approaches the combination of audio and video information [2] and microphone arrays [3]. Computer vision is no doubt the most useful modality. Its non-invasiveness is the most important advantage. In this paper, three computer vision techniques for humanrobot interaction are described. All of them have been used in a prototype social robot [4]. The robot is an animal-like head that stands on a table and has the goal of interacting with people. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 725–732, 2004. © Springer-Verlag Berlin Heidelberg 2004
726
O. Deniz et al.
2 Omnidirectional Vision Most of social robots built use two types of cameras: a wide field of view camera (around 70 deg), and a foveal camera. The omnidirectional camera shown in Figure 1 gives the robot a 180 deg field of view, which is similar to that of humans. The camera is to be placed in front of the robot. The device is made up of a low-cost USB webcam, construction parts and a curved metallic surface looking upwards, in this case a kitchen ladle.
Fig. 1. Omnidirectional camera.
As for the software, the first step is to discard part of the image, as we want to watch only the frontal zone, covering 180 degrees from side to side. Thus, the input image is masked in order to use only the upper half of an ellipse, which is the shape of the mirror as seen from the position of the camera. A background model is obtained as the mean value of a number of frames taken when no person is present in the room. After that, the subtracted input images are thresholded and the close operator is applied. From the obtained image, connected components are localized and their area is estimated. Also, for each connected component, the Euclidean distance from the nearest point of the component to the center of the ellipse is estimated, as well as the angle of the center of mass of the component with respect to the center of the ellipse and its largest axis. Note that, as we are using an ellipse instead of a circle, the nearness measure obtained (the Euclidean distance) is not constant for a fixed real range to the camera, though it works well as an approximation, see Figure 2. The background model M is updated with each input frame:
where I is the input frame and U is the updating function:
Useful Computer Vision Techniques for Human-Robot Interaction
727
Fig. 2. Approximate distance measure taken with the omnidirectional camera as a person gets closer to the robot.
(between 0 and 1) and control the adaptation rate. Note that M, U and D are images, the and variables have been omitted for simplicity. For large values of and the model adaptation is slow. In that case, new background objects take longer to enter the model. For small values of and adaptation is faster, which can make animated objects enter the model. The method described up to this point still has a drawback. Inanimate objects should be considered background as soon as possible. However, as we are working at a pixel level, if we set the alfa and beta parameters too low we run the risk of considering static parts of animate objects as background too. This problem can be alleviated by processing the image D. For each foreground blob, its values in D are examined. The maximum value is found, and all the blob values in D are set to that level. Let the foreground blobs at time step be represented as:
There are N B blobs, each one with
pixels. Then, after (3) the following is applied:
With this procedure the blob only enters the background model when all its pixels remain static. The blob does not enter the background model if at least one of its pixels has been changing.
728
O. Deniz et al.
3 Face Detection Omnidirectional vision allows the robot to detect people in the scene, just to make the neck turn towards them (or somehow focus its attention). When the neck turns, there is no guarantee that omnidirectional vision has detected a person, it can be a coat stand, a wheelchair, etc. A face detection module should be used to detect people (and possibly facial features). Facial detection commonly uses skin-color as the most important feature. Color can be used to detect skin zones, though there is always the problem that some objects like furniture appear as skin, producing many false positives. Figure 3 shows how this problem affects detection in the ENCARA facial detector [5], which (besides other additional cues) uses normalized red and green color components for skin detection.
Fig. 3. Skin color detection. Note that wooden furniture is a distractor for facial detection. Both the bounding box and the best-fit ellipse are rather innacurate (left).
In order to alleviate this problem, stereo information is very useful to discard objects that are far from the robot, i.e. in the background. Stereo cameras are nowadays becoming cheaper and faster. A depth map is computed from the pair of images taken by the stereo camera. For some cameras, the depth map is efficiently computed with an included optimized algorithm and library. The map is thresholded and an AND operation is performed between this map and the image that the facial detector uses. Fusion of color and depth was also used in [6,7,8]. The results are shown in Figure 4. Note that most of the undesired wood colored zones are filtered out.
Fig. 4. Skin color detection using depth information.
Useful Computer Vision Techniques for Human-Robot Interaction
729
4 Head Nod/Shake Detection Due to the fact that practical (hands-free) voice recognition is very difficult to achieve for a robot, we decided to turn our attention to simpler (though useful) input techniques such as head gestures. Head nods and shakes are very simple in the sense that they only provide yes/no, understanding/disbelief, approval/disapproval meanings. However, their importance must not be underestimated because of the following reasons: the meaning of head nods and shakes is almost universal, they can be detected in a relatively simple and robust way and they can be used as the minimum feedback for learning new capabilities. The system for nod/shake detection described in [9] achieves a recognition accuracy of 78.46%, in real-time. However, the system uses complex hardware and software. An infrared sensitive camera synchronized with infrared LEDs is used to track pupils, and a HMM based pattern analyzer is used to the detect nods and shakes. The system had problems with people wearing glasses, and could have problems with earrings too. The same pupil-detection technique was used in [10]. That work emphasized the importance of the timing and periodicity of head nods and shakes. However, in our view that information is not robust enough to be used. In natural human-human interaction, head nods and shakes are sometimes very subtle. We have no problem in recognizing them because the question has been clear, and only the YES/NO answers are possible. In many cases, there is no periodicity at all, only a slight head motion. Of course, the motion could be simply a ‘Look up’/‘Look down’/‘Look left’/‘Look right’, though it is not likely after the question has been made. For our purposes, the nod/shake detector should be as fast as possible. On the other hand, we assume that the nod/shake input will be used only after the robot has asked something. Thus, the detector can produce nod/shake detections at other times, as long as it outputs right decisions when they are needed. The major problem of observing the evolution of simple characteristics like intereye position or the rectangle that fits the skin-color blob is noise. Due to the unavoidable noise, a horizontal motion (the NO) does not produce a pure horizontal displacement of the observed characteristic, because it is not being tracked. Even if it was tracked, it could drift due to lighting changes or other reasons. In practice, a horizontal motion produces a certain vertical displacement in the observed characteristic. This, given the fact that decision thresholds are set very low, can lead the system to error. The performance can be even worse if there is egomotion, like in our case (camera placed on a head with pan-tilt). The proposed algorithm uses the pyramidal Lucas-Kanade tracking algorithm described in [11]. In this case, there is tracking, and not of just one, but multiple characteristics, which increases the robustness of the system. The tracker looks first for a number of good points to track, automatically. Those points are accentuated corners. From those points chosen by the tracker we can attend to those falling inside the rectangle that fits the skin-color blob, observing their evolution. Note that even with the LK tracker there is noise in many of the tracking points. Even in an apparently static scene there is a small motion in them. The procedure is shown in Algorithm 1. The method is shown working in Figure 5. The LK tracker allows to indirectly control the number of tracking points. The larger the number of tracking points, the more robust (and slow) the system. The method was tested giving a recognition rate of 100% (73 out
730
O. Deniz et al.
of 73, questions with alternate YES/NO responses, using the first response given by the system).
Fig. 5. Head nod/shake detector.
What happens if there are small camera displacements? In order to see the effect of this, linear camera displacements were simulated in the tests. In each frame, an error is added to the position of all the tracking points. If is the average displacement of the points inside the skin-color rectangle, then the new displacement is and The error, which is random and different for each frame, is bounded by and Note that in principle it is not possible to use a fixed threshold because the error is unknown. The error also affects to the tracking points that fall outside the rectangle. Assuming that the objects that fall outside the rectangle are static we can eliminate the error and keep on using a fixed threshold, for and For the system to work well it is needed that the face occupies a large part of the image. A zoom lens should be used. When a simulated error of pixels was introduced, the recognition rate was 95.9% (70 out of 73). In this case there is a slight error due to the fact that the
Useful Computer Vision Techniques for Human-Robot Interaction
731
components and are not exactly zero even if the scene outside the rectangle is static. Another type of error that can appear when the camera is mounted on a mobile device like a pan-tilt unit is the horizontal axis inclination. In practice, this situation is common, especially with small inclinations. Inclinations can be a problem for deciding between a YES and a NO. In order to test this effect, an inclination error was simulated in the tests (with the correction of egomotion active). The error is a rotation of the displacement vectors D a certain angle clockwise. Recognition rates were measured for different values of producing useful rates for small inclinations: 90% (60 out of 66) for 83.8% (57 out of 68) for and 9.5% (6 out of 63) for
5 Conclusions Three simple but useful computer vision techniques have been described, suitable for human-robot interaction. First, an omnidirectional camera setting is described that can detect people in the surroundings of the robot, giving their angular positions and a rough estimate of the distance. The device can be easily built with inexpensive components. Second, we comment on a color-based face detection technique that can alleviate skincolor false positives. Third, a simple head nod and shake detector is described, suitable for detecting affirmative/negative, approval/dissaproval, understanding/disbelief head gestures. The three techniques have been implemented and tested on a prototype social robot. Acknowledgments. This work was partially funded by research projects PI2003/160 and PI2003/165 of Gobierno de Canarias and UNI2002/16, UNI2003/10 and UNI2003/06 of Universidad de Las Palmas de Gran Canaria.
References 1. Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots. Robotics and Autonomous Systems 42 (2003) 2. Liu, X., Zhao, Y., Pi, X., Liang, L., Nefian, A.: Audio-visual continuous speech recognition using a coupled Hidden Markov Model. In: IEEE Int. Conference on Spoken Language Processing. (2002) 213–216 3. McCowan, I.: Robust speech recognition using microphone arrays. PhD thesis, Queensland University of Technology, Australia (2001) 4. Deniz, O.,Castrillon,M.,Lorenzo, J., Guerra,C,Hernandez, D., Hernandez, M.: CASIMIRO: A robot head for human-computer interaction. In: Proceedings of 11th IEEE International Workshop on Robot and Human Interactive Communication (ROMAN’2002). (2002) Berlin, Germany. 5. Castrillon, M.: On Real-Time Face Detection in Video Streams. An Opportunistic Approach. PhD thesis, Universidad de Las Palmas de Gran Canaria (2003) 6. Darrell, T., Gordon, G., Harville, M., Woodfill, J.: Integrated person tracking using stereo, color, and pattern detection. In: Procs. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA (1998) 601–608
732
O. Deniz et al.
7. Moreno, F., Andrade-Cetto, J., Sanfeliu, A.: Localization of human faces fusing color segmentation and depth from stereo (2001) 8. Grange, S., Casanova, E., Fong, T., Baur, C.: Vision based sensor fusion for human-computer interaction (2002) 9. Kapoor, A., Picard, R.: A real-time head nod and shake detector (2001) 10. Davis, J., Vaks, S.: A perceptual user interface for recognizing head gesture acknowledgements. In: Proc. of ACM Workshop on Perceptual User Interfaces, Orlando, Florida (2001) 11. Bouguet, J.: Pyramidal implementation of the Lucas Kanade feature tracker. Technical report, Intel Corporation, Microprocessor Research Labs, OpenCV documents (1999)
Face Recognition with Generalized Entropy Measurements Yang Li and Edwin R. Hancock Department of Computer Science University of York York, UK YO10 5DD
Abstract. This paper describes the use of shape-from-shading for face recognition. We apply shape-from-shading to tightly cropped face images to extract fields of surface normals or needle maps. From the surface normal information, we make estimates of curvature attributes. The quantities studied include minimum and maximum curvature, mean and Gaussian curvature, and, curvedness and shape index. These curvature attributes are encoded as histograms. We perform recognition by comparing the histogram bin contents using a number of distance and similarity measures including the Euclidean distance, the Shannon entropy, the Renyi entropy and the Tsallis entropy. We compare the results obtained using the different curvature attributes and the different entropy measurements.
1
Introduction
The aim in this paper is to investigate whether the curvature information delivered by shape-from-shading can be used for histogram-based face recognition. Histograms have proved to be simple and powerful attribute summaries which are very effective in the recognition of objects from large image data-bases. The idea was originally popularized by Swain and Ballard who used color histograms [11]. There have since been several developments of the idea. For instance Gimelfarb and Jain [2] have used texture histograms for 2D object recognition, Dorai and Jain [1] have used shape index histograms for range image recognition and relational histograms have also been used for line pattern recognition [4]. Here we explore whether curvature attributes extracted from the surface normals obtained using shape-from-shading can be used for the purpose of histogram-based face recognition. The attributes explored are maximum and minimum curvature, mean and Gaussian curvature, and, curvedness and shape index. We compute distances between histograms using a number of entropybased measurements. These include the Shannon entropy [10], the Renyi entropy [8] and the Tsallis entropy [7]. We present a quantitative study of recognition performance using precision-recall curves for the different curvature attributes and the different entropy measurements. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 733–740, 2004. © Springer-Verlag Berlin Heidelberg 2004
734
2
Y. Li and E.R. Hancock
Curvatures from Shape-from-Shading
Curvature Calculation: The aim in this paper is to investigate whether histograms of curvature attributes can be used for the face recognition under different conditions. Gordon [3] has listed three potential benefits that curvature features offer over intensity-based features for face recognition. Specifically, curvature features 1) have the potential for higher accuracy in describing surface based events, 2) are better suited to describe properties of the face in areas such as the cheeks, forehead, and chin, and 3) are viewpoint invariant. Although there are several alternatives, Gordon [3] found mean and Gaussian curvature to be effective in the recognition of faces from range imagery. The aim in this paper is to investigate whether similar ideas can be applied to the surface normals extracted from 2D intensity images of faces using shape-from-shading. This extends earlier work where we have shown that the surface normals delivered by shape-from-shading can be used for 3D face pose estimation from 2D images [5]. In this section we describe how to compute a variety of curvature attributes from the fields of surface normals or needle maps delivered by the shape-fromshading. The shape-from-shading we employ here is the algorithm proposed by Worthington and Hancock [14]. It has been proved to deliver needle-maps which preserve fine surface detail. The starting point for our curvature analysis of the fields of surface normals delivered by shape-from-shading is the Hessian matrix [13]. The differential structure of a surface is captured by the Hessian matrix (or the second fundamental form), which may be written in terms of the derivatives of the surface normals:
where and are the gradients of the surface in the x and y directions. For smooth surfaces, H is symmetric and the surface is hence integrable. A number of curvature attributes can be computed from the Hessian matrix. These include The principal curvatures can be obtained by computing the eigenvalues and of the Hessian matrix H. In terms of the row and column elements of H, the principal curvatures are The maximum curvature and the minimum curvature where The mean and Gaussian curvatures
The shape index and curvedness [6]
Face Recognition with Generalized Entropy Measurements
735
Histogram Construction: From the curvature attributes, we construct 1D and 2D histograms. The histograms used in our study are: 1. 2. 3. 4. 5.
2D maximum and minimum curvature histograms; 2D mean and Gaussian curvature histograms; 2D shape index and curvedness histograms; 1D shape index histograms; 1D curvedness histograms;
The 1D histograms have 100 uniformly spaced bins, while the 2D histograms have 50 × 50 bins. These histograms are normalized so that the total bin-contents sums to unity.
3
Generalized Entropy Measurements
Once the histograms are normalized, we can use their bin contents as a discretely sampled estimate of the curvature attribute probability density [7] In our approach, the dissimilarity between histograms is measured by either the Euclidean distance or generalized entropy. There are three different classes of generalized entropy that can be used for this purpose: Boltzmann-Shannon Entropy: The Boltzmann-Shannon entropy was first proposed as an information measurement in communication systems[10]. It is defined as
Renyi Entropy: The Renyi entropy [8]is one of the canonical generalized entropies. It is defined as
where is a variable called the information order, and density of random variable Tsallis Entropy: The Tsallis entropy [7] is defined as
4
is the probability
Experimental Results
For our experimental investigation of face recognition using curvature attributes delivered by shape-from-shading we use two data-bases:
736
Y. Li and E.R. Hancock
The first one is the face data-base from the AT&T Research Laboratory in Cambridge [9]. It contains 400 face images, which belong to 40 individuals with 10 images for each subject. Each image is a 92 × 112 PGM file. Here the faces are all viewed in a frontal pose, although in some cases there is a small degree of tilt and slant. The second one is a semi-synthetic face data-base from the Max Planck Institute. It contains 120 face images, which belong to 40 individuals and 3 images for each subject.The faces were laser scanned to give height maps. Face images were obtained by rendering the height maps with a single point light source positioned at infinity with a direction the same as the viewer. Each image is a 128 × 128 square PGM file. These images include the frontviewed images, and also the images with +30 and -30 degree rotations. Examples of groups of faces respectively from these two data-bases are given in Figure 1 and 2.
Fig. 1. An example of faces of the same subject of the AT&T face data-base
Fig. 2. An example of faces of the same subject of the synthetic face data-base
We extract curvature attributes from both real and synthetic face images and then construct histograms of the various curvature attributes as described in Sect. 2.
Face Recognition with Generalized Entropy Measurements
737
With the histograms in hand, we measure the dissimilarity of face images using the Euclidean distance or the generalized entropy measurements between the corresponding histograms. Here we use precision-recall curves to evaluate the performance. In total there are 16 precision-recall curves, these correspond to 4 (curvatures of maximum and minimum, mean and Gaussian, curvedness and shape index, and, shape index only) × 4 (measurements of Euclidean distance, Shannon entropy, Renyi entropy and Tsallis entropy). The following conclusions can be drawn from the precision-recall curves in Fig. 3: Curvature Performance: Among the various curvatures, we found that the histogram of curvedness and shape index gives the best performance. It is interesting to note that the two individual 1D histograms of shape-index and curvedness give a performance which is lower than the combined histogram. In fact, of the two 1D histograms it is the curvedness, i.e. the overall magnitude of curvature, which gives the best performance, and the shape-index (i.e. the measure of surface topography) gives much poorer performance. This suggests that there may be further scope for investigating how to optimally utilise the two measures for the purposes of recognition Dissimilarity Measurement Performance: Of the various dissimilarity measurements, we found that the Renyi entropy measurement and the Tsallis entropy measurement give the same performance. This is the case for each curvature attribute and both data-bases. At the same time, the Shannon entropy measurement gives the best performance. In the experiment based on the AT&T data-base, all the generalized entropy measurements give overall better performances than the Euclidean distance. Tab. 1, 2, 3 and 4 show the precision values of the first image of the query result of various curvature attributes and dissimilarity measurements. The recall values do not appear since the recall values always keep the constant ratio to the precision values across the different attributes and measurements. The maximum precision rate (the precision rate of the first image) we obtained is 74.50% while Zhao and Chellappa achieved very high precision rates in their similar work [15]. But since their implementation requires the creation of a prototype image and to perform a series of dimension reductions, we can reasonably suppose that a large amount of computation time will be spent on these steps. Our approach can act as a preliminary step which helps improve the performance of further steps in an integrated system.
5
Conclusion
In this paper, we have investigated the use of curvature attributes derived from shape-from-shading for face recognition. We have constructed histograms from maximum and minimum curvature, mean and Gaussian curvature, and, curvedness and shape index. We have investigated the use of a number of generalized
738
Y. Li and E.R. Hancock
Face Recognition with Generalized Entropy Measurements
739
Fig. 3. Precision-recall curves of histograms based on curvature attributes and dissimilarity measurements. The subjects are face images from the AT&T data-base and the synthetic one. (a)The upper left one is the precision-recall curves of curvature attributes of the AT&T data-base; (b)The upper right one is the precision-recall curves of dissimilarity measurements of the AT&T data-base; (c)The lower left one is the precision-recall curves of curvature attributes of the synthetic data-base; (d)The lower right one is the precision-recall curves of dissimilarity measurements of the synthetic data-base.
entropy measurements. From the precision-recall curves for these different attributes and entropy measurements, the main conclusion of our empirical study is are as follows The generalized entropy measurements apparently improves the precisionrecall rate of the face recognition because it magnifies the dissimilarity between histograms.
740
Y. Li and E.R. Hancock
With precision-recall curves, we find that the best performance is delivered when the Shannon entropy is applied to histograms of curvedness and shape index among all the possible combinations of dissimilarity measurements and curvature histograms.
References 1. C. Dorai and A. K. Jain. Cosmos - a representation scheme for free form surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 19(10):1115–1130, October 1997. 2. G. L. Gimel’farb and A. K. Jain. On retrieving textured images from an image database. Pattern Recognition, 29(9):1461–1483, 1996. 3. G. Gordon. Face recognition based on depth and curvature features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 108–110, Champaign, Illinois, USA, June 1992. 4. B. Huet and E. R. Hancock. Line pattern retrieval using relational histograms. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 21(12):1363–1370, December 1999. 5. B. Huet and E. R. Hancock. Relational object recognition from large structural libraries. Pattern Recognition, 35(9):1895–1915, September 2002. 6. J. J. Koenderink and A. J. van Doorn. Surface shape and curvature scales. Image and Vision Computing, 10(8):557–565, October 1992. 7. A. Nobuhide and T. Masaru. Information theoretic learning with maximizing tsallis entropy. In Proceedings of International Technical Conference on Circuits/System, Computers and Communications, ITC-CSCC, pages 810–813, Phuket, Thailand, 2002. 8. A. Renyi. Some fundamental questions of information theory. Turan [12] (Originally: MTA III. Oszt. Kozl., 10, 1960, pp. 251-282), pages 526–552, 1960. 9. F. S. Samaria and A. C. Harter. Parameterisation of a stochastic model for human face identification. In IEEE Workshop on Applications of Computer Vision, Sarasota, Florida, USA, December 1994. 10. C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–653, July and October 1948. 11. M. J. Swain and D. H. Ballard. Indexing via color histogram. In Proceedings of third international conference on Computer Vision (ICCV), pages 390–393, Osaka, Japan, 1990. 12. P. E. Turan. Selected Papers of Alfred Renyi. Akademiai Kiado, Budapest, Hungary, 1976. 13. R. J. Woodham. Gradient and curvature from the photometric stereo method, including local confidence estimation. Journal of the Optical Society of America, 11(11):3050–3068, November 1994. 14. P. L. Worthington and E. R. Hancock. New constraints on data-closeness and needle map consistency for shape-from-shaping. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 21(12):1250–1267, December 1999. 15. W. Zhao and R. Chellappa. Robust face recognition using symmetric shape-fromshading, 1999.
Facial Feature Extraction and Principal Component Analysis for Face Detection in Color Images Saman Cooray1 and Noel O’Connor2 1
School of Electronic Engineering, Dublin City University, Ireland Centre for Digital Video Processing, Dublin City University, Ireland
2
{coorays,
oconnorn}@eeng.dcu.ie
Abstract. A hybrid technique based on facial feature extraction and Principal Component Analysis (PCA) is presented for frontal face detection in color images. Facial features such as eyes and mouth are automatically detected based on properties of the associated image regions, which are extracted by RSST color segmentation. While mouth feature points are identified using the redness property of regions, a simple search strategy relative to the position of the mouth is carried out to identify eye feature points from a set of regions. Priority is given to regions which signal high intensity variance, thereby allowing the most probable eye regions to be selected. On detecting a mouth and two eyes, a face verification step based on Eigenface theory is applied to a normalized search space in the image relative to the distance between the eye feature points. Keywords: face detection, facial feature extraction, PCA, color segmentation, skin detection
1 Introduction Face detection is an important task in facial analysis systems in order to have a priori localized faces in a given image. Applications such as face tracking, facial expression recognition, gesture recognition, etc., for example, have a pre-requisite that a face is already located in the given image or the image sequence. Numerous face detection techniques have been proposed to address the challenging issues associated with this problem in the literature. These techniques generally fall under four main categories of approach: knowledge-based, feature invariant, template matching, and appearancebased [1]. Some algorithms rely solely on low-level image properties such as color and image contours from which image blobs are detected and compared with predefined shapes (elliptical shape) [1][2]. Combining facial features, which are detected inside the skin color blobs, helps to extend the above type of approach towards more robust face detection algorithms [3][4]. Facial features derived from gray scale images along with some classification models have also been used to address this problem [5]. Menser and Muller presented a method for face detection by applying PCA on skin tone regions [6]. Using the appearance-based properties in more efficient ways to classification, upright frontal face detection in gray scale A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 741–749, 2004. © Springer-Verlag Berlin Heidelberg 2004
742
S. Cooray and N. O’Connor
images through neural networks has proven to be a promising solution to this problem [7]. Chengjun Liu proposed a face detection technique based on discriminating feature analysis, statistical modeling of face and non-face classes, and a Bayes classifier to detect frontal faces in gray scale images [8]. In this paper, we present a hybrid approach for frontal face detection in color images based on facial feature extraction and the use of appearance based properties of face images. This is followed by the face detection algorithm proposed by Menser and Muller, which attempted to localize the computation of PCA on skin-tone regions. Our approach begins with a facial feature extraction algorithm which illustrates how image regions segmented using chrominance properties can be used to detect facial features, such as eyes and mouth, based on their statistical, structural, and geometrical relationships in frontal face images. Applying statistical analysis based on PCA to a smaller search space (a normalized search space) of the image then performs the face detection task. The block diagram of the proposed system is shown in Fig. 1. The first task of face detection in this system is skin detection which is carried out using a statistical skin detection model built by acquiring a large training set of skin and non-skin pixels. A skin map is generated through direct reference to the pre-computed probability map in the RGB space and using a simple threshold criterion. A face-bounding box is then obtained from the skin map to which the RSST segmentation algorithm is applied for creating a segmentation partition of homogeneous regions. Possible mouth features are first identified based on the redness property of image pixels and the corresponding RSST regions. Eye features are then identified relative to the position of the mouth, by searching for regions which satisfy some statistical, geometrical, and structural properties of the eyes in frontal face images. On detecting a feature set containing a mouth and two eyes, PCA analysis is performed over a normalized search space relative to the distance between the two eyes. The image location corresponding to the minimum error is then considered the position of the detected face.
Fig. 1. Face detection system
The paper is organized as follows. In section 2, a region-based facial feature extraction technique is described. Starting with a brief description of the RSST segmentation algorithm in section 2.1, the mouth detection and eye detection tasks are described in sections 2.2 and 2.3 respectively. The combined approach to face detection using facial features and PCA analysis is described in section 3. Experimental results are presented in section 4. Some conclusions and future work are then given in section 5.
Facial Feature Extraction and Principal Component Analysis for Face Detection
743
2 Facial Feature Extraction A general view of a frontal face image containing a mouth and two eyes is shown in Fig. 2. and represent left and right eyes respectively, while M represents the mouth feature. The distance between the two eyes is w, and the distance from the mouth to the eyes is h. In frontal face images, structural relationships such as the Euclidean distance between the mouth, and the left and right eye, the angle between the eyes and the mouth, provide useful information about the appearance of a face. These structural relationships of the facial features are generally useful to constrain the facial feature detection process. A search area represented by the square of size (3w x 3w) is also an important consideration in order to search for faces based on the detected eye feature positions in the image.
Fig. 2. A frontal face view
Fig. 3. RSST color segmentation (a) face-bounding box (b) segmentation based on luminance and chrominance merging (c) segmentation based on chrominance merging
2.1 Recursive Shortest Spanning Tree (RSST) Color Segmentation Algorithm The process of region merging in the conventional RSST algorithm is defined by the merging distance that is related to both the luminance and chrominance properties in the color space. However, it was found in our experiments that the eye detection task could be performed using only the chrominance components in the merging distance. The conventional RSST merging distance and the modified merging distance are defined by equations (1) and (2) respectively. The distance d(R1,R2) represents the merging distance between the two regions R1 and R2 with their mean luminance, mean chrominance, and spatial size represented by Y(R), and N(R) respectively. Two segmentations shown in Fig. 3b and Fig. 3c are obtained from these two distance measures when RSST is performed on the facebounding box shown in Fig. 3a. This highlights the fact that distinct eye regions can be obtained more accurately from the chrominance-based merging.
744
S. Cooray and N. O’Connor
2.2 Mouth Detection The mouth detection task is performed based on the redness property of lips (mouth). After extracting a face-bounding box from the skin detection process, the red color lips are detected using the criterion defined in equation (3), and represented as a mouth map [4].
where N represents the spatial size of the face-bounding box. Regions in the segmentation that correspond to the detected mouth map are first identified. In case of the presence of multiple regions, they are then merged based on their proximity and represented as a single mouth map. The center of gravity of the combined regions is then considered to be the mouth feature position.
2.3 Eye Detection An important statistical property of eye image regions is that they correspond to high intensity variance as a result of the fact that human eyes generally contain both black (near black) and white (near white) regions. Such regions can be identified by computing their variances using equation (4). This principle is illustrated in the graph shown in Fig. 4, which shows the distribution of different regions’ variances for the segmentation given in Fig. 3c against their 2D positions in the original image. This shows the important feature that only a few regions show high variance.
where Y represents the intensity value of each pixel in the region, intensity of the region, and N is the spatial size of the region.
is the mean
However, in practice, a variance measure alone will not be sufficient to find eye regions in a given image, although it can provide some useful clues. This fact leads us to constrain the eye search process by relating it with geometrical and structural properties of eye regions in frontal face images. Hence, the following heuristic rules
Facial Feature Extraction and Principal Component Analysis for Face Detection
745
are applied in the eye detection process (geometrical parameters are given with reference to Fig. 2). Eye region should be at least 10 pixels above the mouth level, i.e. h>=10 pixels. Width/height ratio of eye regions should be at least 0.4. Distance from the mouth to the left and right eyes should be within a pre-defined range, i.e. Angle between the mouth and the eyes should be within a predefined range, i.e. Eye region should correspond to a dark blob in the image.
Fig. 5. Gradient orientation of pixels for 5X5 masks Fig. 4. Variance distribution of regions corresponding to a face segmentation containing 61 image regions
While the first four conditions are simple heuristic rules and require no description, the fifth condition uses the feature of dark blobs (corresponding to pupils) present in human eyes. A dark/bright blob detection technique, specifically proposed for facial feature detection by Lin and Lin [9], is used for identifying dark blobs (gradients pointing away from dark to bright) in this system. It computes the radial symmetry about a center pixel considering the gradient distributions of local neighbors of a predefined mask. Fig. 5 shows the possible orientations of pixel gradients with respect to a 5X5 mask. Lin and Lin pointed out that the algorithm produces quite dense radially symmetric feature points at this stage, most of them corresponding to spurious or nonfacial feature points. Thus, two inhibitory mechanisms are used to suppress the spurious points: regions of uniform gradient distribution should not be considered, and a line sketch of the input image should be used to eliminate areas which are not covered by sketch lines. However, we can avoid the use of these two additional inhibitory conditions by applying the algorithm on chrominance image rather than on luminance image.
746
S. Cooray and N. O’Connor
3 Combining Facial Features with Appearance-Based Face Detection A difficulty of using PCA as a face classification step is due to the inability of properly defining an error criterion on face images. As a result of this, inaccurate face images can signal a smaller error, resulting in an image block with the minimum error converging to a wrong face image. However, when the search space is reduced to a smaller size, the intended results can be achieved in most cases. This phenomenon is illustrated in Fig. 6.
Fig. 6. Detected faces using PCA alone
Due to PCA being a sub-optimal solution for face classification, we use an approach to perform face detection based on the use of both facial features and PCA. The objective of using facial features in this system is to localize the image area on which the PCA analysis is performed. In this context, the detected eye facial feature points are used to define a normalized search space. A search area is first defined according to the eye-eye distance, and it’s of size 3w x 3w (see Fig. 2). A scale factor is then calculated according to the positions of the detected eye feature points and the eye-eye distance of the predefined face model (8 pixels). Hence, a normalized image search space is obtained on which PCA is performed based on 16X16 image blocks. It should be noted that the use of a normalized search space in this system avoids the requirement of analyzing the image at different resolution levels to locate faces of different scales. The principal theory of detecting faces in images using Eigenfaces is defined by how far a reconstructed image is from the face space. This distance, known as Distance from Face Space (DFFS), is defined by equation (5) [6][10].
where x is the current image vector, is the mean face vector, and principal components corresponding to the M eigenvectors.
are the M
Menser and Muller also used a modified error criterion (e) by incorporating another distance measure called “Distance in Face Space (DIFS)”, which is defined by equation (6). The modified error criterion is defined by equation (7) [8].
Facial Feature Extraction and Principal Component Analysis for Face Detection
where c is a constant which is defined as
747
The smallest computed
eigenvalue, represented by and an empirically chosen constant k, are the two parameters that define the value of constant c. In our system, a total number of 500 face images were used as the set of training images – taken from the ECU database [11]. The set of 500 training images was obtained by doubling the original number of images with their mirror images. The 250-image training set is a collection of 100 frontal upright images, 50 frontal images with glasses, and 100 slightly rotated face images selectively chosen from the second set of face patterns of the ECU database.
4 Experimental Results and Analysis Fig. 7 shows some examples of the detected facial features superimposed on the original images. Fig. 7a is a bright frontal face image whereas Fig. 7b is a frontal face image with glasses causing bright reflections. A half frontal face image is shown in Fig. 7c. A randomly selected image of two faces is shown in Fig. 7d. Accurate results are reported in the first, third, and the fourth image while slightly inaccurate eye feature points have been detected in the second case. Bright reflections caused by glasses in the second image have led to errors in the eye detection process.
Fig. 7. Facial feature extraction results
Face detection results are given in Fig 8. Our experiments were carried out on test images taken from the HHI face database and various other randomly selected sources. We noted that the slight inaccuracies occurred in the facial feature extraction process did not affect the performance of face detection. A performance analysis of the face verification step is presented in Table 1 given accurately detected eye features for 125 single face images. This analysis is presented in terms of correct
748
S. Cooray and N. O’Connor
detections, false rejections and false detections for three cases on the error term e, where the minimum error of e is considered in the first case while two threshold values, T1 and T2 (T2
Fig. 8. Face detection results
5 Conclusion and Future Work A hybrid solution to frontal face detection using facial features and Eigenfaces theory is presented. Using a facial feature extraction step prior to performing PCA analysis helps to address two requirements for this system. Firstly, the search for faces does not need to be carried out at every pixel location in the image since a small search space can be obtained using the detected facial feature points. Secondly, the face detection process can be carried out in one cycle over a normalized search space, thereby avoiding the requirement of processing the image at multiple scales. However, due to the fact that PCA is a sub-optimal solution to face classification,
Facial Feature Extraction and Principal Component Analysis for Face Detection
749
detection of some real-world images can be difficult, resulting in performance degradation in the system. For this reason, we believe that the performance of this system can be improved by extending the face classification step towards a two-class classification problem with the use of a carefully chosen set of non-faces as the second class. Acknowledgment. This research is carried out as part of the MobiSyM project funded by Enterprise Ireland. Their support is gratefully acknowledged.
References M.-H. Yang, D. J. Kriegman and N. Ahuja, “Detecting Faces in Images: A Survey”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24, no. 1, pp. 34-58, January 2002. 2. A. Eleftheriadis and A Jacquin, “Automatic Face Location, Detection and Tracking for Model Assisted Coding of Video Teleconferencing Sequences at Low Bit Rates”, Signal processing: Image Communication, vol. 7, no. 3, pp. 231-248, July 1995. 3. K. Sobottka and I. Pitas, “A Novel Method for Automatic Face Segmentation, Facial Feature Extraction and Tracking”, Signal Processing: Image Communication, vol. 12, no.3, pp. 263-281, 1998. 4. R.-L. Hsu, M. Abdel-Mottaleb and A. K. Jain, “Face Detection in Color Images”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May 2002. 5. K. C. Yow, R. Cipolla, “Feature-Based Human Face Detection”, Image and Vision Computing, vol. 15, pp. 713-735, 1997. 6. B. Menser and F. Muller, “Face Detection in Color Images using Principal Component Analysis”, IEE Conference Publication, vol.2, no. 465, pp. 620-624, 1999. 7. H. A. Rowley, S. Baluja and T. Kanade, “Neural Network-Based Face Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23-38, January 1998. 8. C. Liu, “A Bayesian Discriminating Features Method for Face Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23-38, January 1998. 9. C.-C. Lin and W.-C. Lin, “Extracting Facial Features by an Inhibitory Mechanism based on Gradient Distributions”, Pattern Recognition, vol. 29, no. 12, 1996. 10. M. Turk and A. Pentland, “Eigenfaces for Recognition”, Journal of Cognitive Neuroscience, vol.3, no. 1, pp. 71-86, 1991. 11. S. L. Phung, D. Chai and A. Bouzerdoum, “Adaptive Skin Segmentation in Color Images”, International Conference on Acoustic, Speech and Signal processing (ICASSP), vol. 3, pp. 353-356, April 2003.
1.
Fingerprint Enhancement Using Circular Gabor Filter En Zhu, Jianping Yin, and Guomin Zhang School of Computer Science, National University of Defense Technology, Changsha 410073, China
[email protected];
[email protected]
Abstract. Fingerprint minutiae are prevalently used in AFISs. The extraction of fingerprint minutiae is heavily affected by the quality of fingerprint images. This leads to the incorporation of a fingerprint enhancement module in AFIS to make the system robust with respect to the quality of input fingerprint images. Most of existing enhancement methods suffer from mainly two kinds of defects: (1) time consuming and thus unusable in time critical applications; and (2) blocky effect in the enhanced image. This paper follows Hong’s Gabor filter based enhancement scheme (IEEE Trans. PAMI, vol.20, no.8, pp. 777-789, 1998) but uses a circle support filter and tunes the filter’s frequency and size differently. This scheme can enhance the fingerprint image rapidly and overcome the blocky effect effectively and does improve the performance of minutiae detection.
1 Introduction Fingerprint recognition is the most popular method in biometric authentication at present. Minutiae, typically including termination and bifurcation, are characteristic features of fingerprints that determine their uniqueness. In AFIS, valid minutiae can be hidden and spurious minutiae can be produced due to the low quality of fingerprint image. Therefore fingerprint enhancement is often required to enhance the quality of fingerprint image. Some fingerprint enhancement schemes proposed by researchers are available [1-7]. These works successfully remove noise. Most of them [2-7] are highly dependent on the orientation of the ridges and some works [5-7] performed local estimates in a highly discretized manner and resulted in blocky effect. [1] proposed a decomposition method to estimate the orientation field from a set of filtered images obtained by applying a bank of Gabor filters on the input images. This method is computationally expensive and is unusable in some time critical applications. Later, Hong [4] proposed a fast enhancement algorithm, we call it HWJ algorithm, which segments an input fingerprint image into non-overlapping blocks and adaptively enhances each block using both the local ridge orientation and local frequency information. However this method is of blocky effect in some enhanced images that (1) neighbor blocks can not join perfectly as shown in Fig. 3(b) due to each block is enhanced with its local frequency and (2)Sometimes different blocks of different orientations have different clarities in the enhanced image as shown in Fig. 1. This paper follows Hong’s work [4] but uses a different filter shape and tunes filter’s frequency and size differently to overcome block effect. In order to overcome the first kind of block effect, we can apply a filter with fixed frequency on the entire image. However split ridges may appear in the enhanced image as shown in Fig. 3(c) due to the excessively large frequency of the filter, and some ridges may disappear as A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 750–758, 2004. © Springer-Verlag Berlin Heidelberg 2004
Fingerprint Enhancement Using Circular Gabor Filter
751
shown in Fig. 3(d) if the filter’s frequency is too small. Inter-ridge distances of different fingerprint images are distributed between 3 and 25 [4], and it is unsuitable to apply a fixed frequency to different images. In this work, we enhance an input fingerprint image using Gabor filter with its frequency tuned to the image’s average frequency to avoid the first kind of blocky effect. Although different regions in a fingerprint image may have different frequencies, our experiments show that the difference between the block local frequency of an input image and its average frequency is small and will not result in ridge split or disappearance. The second kind of blocky effect results from the filter’s square support. Square-shaped Gabor filter will enhance different blocks of different orientations to different clarities. In order to overcome this kind of block effect, in this work, we use circle support Gabor filter. This work differs with HWJ method [4] mainly in the following: (1) HWJ method applies different frequencies on different blocks and is of blocky effect. We apply average frequency on the entire image to avoid this kind of blocky effect. (2) HWJ method uses square Gabor filter. In this work, we use circular Gabor filter to prevent the second kind of block effect described above. (3) HWJ method fixes the Gabor filter size to 11X11. The size-fixed filter cannot work well with the fingerprint image’s average inter-ridge distance changes. Our experiments show that the Gabor filter will work well when its size is thrice the width of ridge. And thus, in this work, we apply the Gabor filter with its size adjusted to thrice the width of ridge. (4) HWJ method uses a normalizing step before orientation field estimate and our method drops this step. In the following sections we will describe in detail our enhancement algorithm. Section 2 contains the main steps. Section 3 contains the experimental results of our algorithm. Section 4 summaries the algorithm.
Fig. 1. Enhancement using (a) square Gabor filter and local frequency, (b) circular Gabor filter and average frequency. (1) Local frequency effect. (2) Square filter effect. (3) No blocky effect.
2 Fingerprint Enhancement Let
I
denote
a
gray-level
fingerprint image of size m × n, where is the intensity of the pixel at the xth row and yth
column. Divide I into non-overlapping blocks of size w×w (15×15). Each block is denoted as W(i, j) where (i, j) is the location of W. The center coordination of W(i, j) is (i×w+w/2, j×w+w/2).
Let
M(i, j)
mark
the
fore-
752
E. Zhu, J. Yin, and G. Zhang
ground/background of I . If M(i, j) = 1, W(i,j) is a foreground block, else background.
is defined as the orientation of W(i, j) . Some meth-
ods about orientation estimation are reported in literatures [8, 9]. The orientation computation method used in our scheme is the method in [9]. And we use block variance v(i, j) and its directionality g(i, j) to determine whether W(i, j) is a foreground block or not. When segmenting, mark those blocks where v(i, j) and g(i, j) are large as foreground, else background. Directionality g(i, j) is computed as follows:
The orientation field computation and the image segmentation are followed by average frequency estimation and then filtering enhancement.
2.1 Average Frequency Estimation In order to estimate the average frequency of an input image, we first compute local ridge distance of each block W(i, j) where M(i, j) = 1, and then compute average inter-ridge distance from the estimation of
Suppose (u, v) is the center of W(i, j) ,
can be described as the following steps.
(1) Create a coordinate system C with origin (u,v) and with ordinate parallel with
(2) Choose a rectangle region R with the center (u, v) and size L×H(33×l5) . And its side of length L is parallel with the abscissa. (3) Project R to the abscissa and get
where
is the intensity of the pixel
at (x, y) in C. (4) Smooth
X[k]
using
Gaussian
filter
and
get
Gauss(i) . (5) Compute
average inter-crest distance
Here, the first 3 steps are similar with HWJ algorithm [4]. Experiments show that local period at minutiae and fingerprint edge is not stable and need to be filtered. The filtering scheme is described as follows:
Fingerprint Enhancement Using Circular Gabor Filter
(1) Estimate
consistency
(2)
between
where
is
753
and
a
threshold
and
Now we get the average frequency of Experiments show that f is stable to fingerprint images of different quality on the whole. And satisfying enhancement result is available using f.
2.2 Fingerprint Enhancement Fingerprint consists of ridges with orientation and frequency. HWJ algorithm [4] Applies a Gabor filter to each block of an input fingerprint image with the filter’s orientation and frequency dynamically tuned to the local orientation and frequency. Different region of fingerprint image has different direction and frequency. Experiments show that ridges in the enhanced image will sometimes not link smoothly as a result of applying different frequencies to different blocks as shown in Fig. 4(b). In order to overcome this kind of block effect, we tune the filter’s frequency to the average frequency of the input fingerprint image. In fact the average frequency is close to the block local frequency. And experiments show that applying average frequency can get perfect enhanced image. Besides this above kind of block effect as a result of applying local frequency to a block, Gabor filter used in HWJ algorithm often enhances different blocks with different directions to different clarities. This kind of block effect is due to the filter’s square support. In order to remove this difference, we using circular filter with the diameter adjusted to trice the ridge width just crossing one ridge and two valleys as shown in Fig. 2. Experiments show that circular Gabor filter will enhance blocks with different orientations to the same clarity and tuning its diameter to trice the ridge width will get the most perfect enhanced image. Suppose f = 1 / T , is the enhanced image from I , then we can get
as the following steps :
where are space constants.
754
E. Zhu, J. Yin, and G. Zhang
An enhancement example using the above scheme is shown in Fig. 3(e).
Fig. 2. Gabor filter’s shape: (a)square; (b)circle.
Fig. 3. Fingerprint enhancement: (a)Original image; (b)Apply local frequency on each block; (c)Apply a fixed frequency bigger than the average frequency on each block; (d) Apply a fixed frequency less than the average frequency on each block; (e)Apply the average frequency on each block.
3 Experiments 3.1 Qualitative Analysis HWJ enhancement scheme applies square Gabor filter of fixed size to any fingerprint image. In order to show how the square Gabor filter and its fixed size affect the enhancement result, we made several set of experiments as shown in Fig. 5. There are 14 images, as shown in Fig. 4, used in experiments. 10 of them are from FVC2000DB1 and 4 images are collected using SecuGen device. Both the approach
Fingerprint Enhancement Using Circular Gabor Filter
755
and approach III apply the average frequency to each block and use size-fixed Gabor. Approach uses square Gabor filter. And approach III uses circular Gabor filter. In some enhanced images of approach some ridges are not smoothly connected. This results from applying local frequency on each block, although the frequency field is handled using extensive smooth. In the enhanced images of approach and approach the regions with vertical or horizontal direction have clearer ridge structure than other regions. This results from the filter’s square shape. And different blocks in the enhanced image of approach III have a basically unique clarity. However, some enhanced images of approach III are not satisfying that some ridges are disconnected in the enhanced image. This effect also exists in approach This results from that the filter’s size has not suited the inter-ridge distance of the input fingerprint image. Our method in the column overcomes the above blocky effects.
Fig. 4. Images used in experiments
3.2 Quantitative Evaluation Here, we have implemented direct gray-scale minutiae detection [10], we call it MM method, for comparison of minutiae detection. The comparison is carried on between HWJ method, MM method and our method. When implementing MM method, the parameter values are the same as [10] except for [10] sets the section parameter to 7. In fact, should have something to do with the inter-ridge distance. For fingerprint images with small inter-ridge distance, when is 7, the section would cover from 2 to 3 ridges. This will lead jumping from one ridge to another when following ridge lines. Besides, MM method is difficult to handle big ridge interrupts.
756
E. Zhu, J. Yin, and G. Zhang
Fig. 5. Enhancement examples
Fingerprint Enhancement Using Circular Gabor Filter
757
[4] uses Good Index for performance evaluation of minutiae detection. The Good Index is defined as
where
represents the
quality of the ith block. We treat the quality of all blocks as the same, then GI = (p – a – b)/t, where p represents the total number of paired minutiae, a represents the total number of missing minutiae, and b represents the total number of spurious minutiae. Because GI will be negative value when p < a + b , so we define Error Index as
When treating all blocks as the same,
EI = (a + b)/t. We use the 14 images in 3.1 for test of minutiae detection and get the EI values as shown in table 1. It shows that the performance on minutiae detection is improved.
4 Conclusion This paper proposed a fingerprint enhancement scheme based on Gabor filter tuning its frequency to the average frequency of the input image and changing its shape from square to circle and dynamically adjusting the filter’s size base on the average frequency. This method can enhance fingerprint image rapidly and overcome the block effect and improves performance of minutiae detection. Although different blocks in a fingerprint image have different frequency, experiments show that each block’s frequency is close to the average frequency and will not cause ridge disappearing or splitting. Both our scheme and HWJ method spend about 250ms in enhancing an image of size 300×300 on an Intel Celeron 1G PC. Our future work will mathematically analyze how the Gabor filter’s size affect the enhanced image.
References [1] Lin Hong, Anil Jain, Sharathcha Pankati, and Ruud Bolle. Fingerprint Enhancement. Proc. IEEE Workshop on Applications of Computer Vision, Sarasota, Fl, pg. 202-207, Dec. 1996. [2] Shlomo Greenberg, Mayer Aladjem, Daniel Kogan, and Itshak Dimitrov. Fingerprint Image Enhancement Using Filtering Techniques. International Conference on Pattern Recognition (ICPR’00)- Volume 3, September 2000.
758
E. Zhu, J. Yin, and G. Zhang
[3] D. Simon-Zorita, J. Ortega-Garcia, S. Cruz-Llanas, J. L. Sanchez-Bote, and J. GlezRodriguez. An Improved Image Enhancement Scheme for Fingerprint Minutiae Extraction in Biometric Identification. Proceedings of the Third Audio and Video-Based Person Authentication, Halmstad, Sweden, June 2001. [4] Lin Hong, Yifei Wang, and Anil Jain; Fingerprint Image Enhancement: Algorithm and Performance Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 777-789, 1998 [5] K. Sasakawa, F. Isogai, and S.Ikebata, Personal verification system with high tolerance of poor quality fingerprint, in Proc. sPIE, 1990, vol.1386, pp.265-272. [6] B.M. Mehtre, Fingerprint image analysis for automatic identification, Machine Vision and Applications, vol. 6, pp.124-139, 1993. [7] O. Bergengruen, Preprocessing of poor quality fingerprint images, in XIV intl. Conf. of the Chilean Computer Science Society, October 1994. [8] Kalle Karu and Anil K. Jain. Fingerprint Classification. Pattern Recognition, vol.29, no.3, pp.389-404, 1996. [9] Asker M. Bazen, Sabih H. Gerez, Systematic Methods for the Computation of the Directional Fields and Singular Points of Fingerprints, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, no.7, pp.905-919, 2002. [10] Maio and Maltoni, Direct gray-scale minutiae detection in finger-prints, IEEE Trans. Pattern Anal. Machine Intell., vol.19, No.1, pp.27-39, January 1997.
A Secure and Localizing Watermarking Technique for Image Authentication Abdelkader H. Ouda and Mahmoud R. El-Sakka Computer Science Department, University of Western Ontario, London, Ontario, Canada {kader,elsakka}@csd.uwo.ca
Abstract. In this paper, a new block-based image-dependent watermarking technique is proposed. The proposed technique utilizes the correlation coefficient statistic to produce a short and unique representation (also known as hashed values or string-sequences) of the image data. These string-sequences are signed by an error-correcting-code signature scheme, which produces short and secure signatures. The image’s least significant bits are utilized to embed these signatures. The used signature scheme requires string-sequences to be decodable syndromes. While the proposed correlation coefficient statistic function produces decodable syndrome string-sequences, most of the existing cryptographic hash functions do not. The results show that the proposed technique has an excellent localization property, where the resolution of the tracked tampered areas can be as small as 9x9 pixel blocks. In addition, the produced watermark has multi-level sensitivity that makes this technique well suited to the region-ofsecurity-important approach, which increases the overall system performance.
1 Introduction The growing development of image processing, as well as the Internet’s popularization, has propelled image authentication issues to the forefront of the digital images field. Image authentication methods attempt to ensure the truthfulness of image content and its integrity. One of the best-known tools that provide reasonable solutions to this issue is Digital watermarking. Digital watermarking is a process in which signals, also known as watermarks, are embedded into digital data (images, video, or audio). These signals can be detected or extracted later to make an assertion about the data. Over the past few years digital watermarking has received considerable attention from leading researchers around the world. Yeung et al. proposed a watermarking technique for image authentication [1]. In this technique the watermark is a binary image with the same size of the original image. This binary image is formed by tiling small binary images, such as a company logo, to cover the size of the original image. A key-based lookup table (LUT) is used in the embedding process. The LUT maps the original image pixels to the corresponding binary values in the binary image. In the verification process, every pixel of the image under question is tested by applying the same LUT to find the corresponding binary value. If the image is altered, the modified locations should appear in the extracted binary image. The advantage of this technique is that the authentication process is done in a pixel-by-pixel basis, and image alteration can be visually detected. However the watermark is image-independent, which weakens the security of the system. Fridrich A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 759–769, 2004. © Springer-Verlag Berlin Heidelberg 2004
760
A.H. Ouda and M.R. El-Sakka
et al. [2] has shown that if the same logo and key are reused for at least two images, it becomes very easy to accurately estimate the LUT. Hence, they proposed a solution by replacing the LUT with a public-key encryption scheme. The proposed modification becomes, however, computationally expensive because the encryption must be done at each pixel. Therefore, in practice this scheme cannot be widely implemented. In 1998, Wong [3] proposed a block-based watermarking technique for image integrity verification, where the used block size was 8x8 pixel. In 2000, Wong and Memon [4] republished this technique with some variations that made it resist the Vector Quantition watermark attack [5]. One year later, the same authors recommended using an image block of size 12x12 [6] in order to hold the full length of the output of the MD5 hash function [7]. The watermark in this technique is constructed, as in Yeung’schnique [1], through a tiled small binary image. The original image is scanned block-by-block, each block is hashed together with some image information. These hashing values are then combined with the binary image using the binary XOR operation. The outut is encrypted and the produced ciphertext is embedded into the corresponding blocks. Recently, Ouda et al., in [9] showed that, this technique suffers from a serious security leak. The main reason of this leak is that, the authors made an assumption that the plaintext size determines the ciphertext size in the signing process. This is no always a true assumption. On the contrary, the truth is, it is the secret key size that determines the ciphertext size. They also proposed a novel solution for these leak such that larger image block, 32x32 pixel, is used while the detection accuracy is kept almost the same as in the original technique. In [10] Fridrich proposed another block-based image authentication technique. This technique is based on the main idea of the Wong’ technique [3]. The main contribution of this technique is proposing a new solution to resist the Vector Quantization attk [5]. Instead of using a fixed binary image, as in Wong’ technique, an 8’16 binary blocks are created and concatenated to form the binary image. These blocks are formed such that it is simply recognizable, and hold some information of the corresponding mage block, such as block index, the image index, and author ID. In fact the proposed solution is also suffer from the same problem of Wong’ technique we mentioned above. Yet the image blocks size is 8x16 pixel, which is not enough to hold a secure signature, e.g., 1024 bits. Over the last few years, many image watermarking techniques are proposed. Yet these techniques have been broken [1,3,4,6,10] (i.e., proven to be cryptographically insecure), or many others have proved to be impractical [2,11,12]. In this paw practical and secure watermarking technique is proposed. This paper is organized as follows. In Section 2, the general framework of the proposed technique is presented, while the detail descriptions of the main components are shown in Sections 3, 4, and 5. Experimental results come in Section 6. The conclusion is offered in Section 7.
2 System Framework The proposed system framework includes two main processes, namely: Watermark generation and embedding process Watermark extraction and authentication process
A Secure and Localizing Watermarking Technique for Image Authentication
761
Fig. 1. The framework of the watermark generation and embedding process.
Each of these processes consists of several other units. In this section the main ideas of each unit will be demonstrated, however the detail descriptions will be given in Sections 3, 4 and 5.
2.1 Watermark Generation and Embedding Process The watermark generation and embedding process is consists of four main units (see Fig. 1). The image divider unit is responsible for dividing an image into small nonoverlap blocks. The image and the sub-blocks dimensions are given to this unit as an input. Section 3, will show how the image divider unit deals with an image to fulfill the demands of the region-of-security-important (ROSI) approach. The main function of the string-sequence generator unit is to produce a hashed value for each image block (see Section 4). These hashed values are generated such that they become valid (decodable) syndromes. These decodable syndromes are used in the digital signature unit to sign the image blocks, using the image private-key. The block signature unit is utilizing a digital signature algorithm based on error-correcting-code cryptosystem. The length of the ciphertext of each signature will be as small as 81 bits. Finally, the embedding unit will insert each signature (ciphertext) into the LSB of the corresponding image block using the first 81 bits only, to produce the watermarked image. Note that, the image block size might be, in some portions of the image, larger than 9x9 pixel. In this case the image divider unit will set the first 81 bits in these blocks by zeros, and leave the rest untouched. Once the watermark (the blocks signatures) is embedded into the image, the image can be securely distributed. 2.2 Watermark Extraction and Authentication Process The watermark extraction and authentication process is responsible for extracting and verifying the signature of the image under question, which originally was signed by the watermark generation process. Fig. 2 illustrates the watermark extraction and authentication five main units. The image divider unit divides the image into small blocks with the same sizes as it is occurred in the embedding process. The watermark extraction unit will extract the block signatures from the LSB of the image. The decryption unit will recover the string-sequences corresponding to each image block, using the image owner public-key. At the same time, the string-sequence generator unit will do the same job as in the embedding process to generate a string-sequence corresponding to each image block. Finally, the sequence comparison unit will test
762
A.H. Ouda and M.R. El-Sakka
Fig. 2. The framework of the watermark extraction and authentication process.
and compare each pair of the extracted and generated string-sequences. By testing all these pairs the sequence comparison unit will produce a binary image to authenticate the image visually. Note that, all information needed during the extraction and verification process has already been embedded into the image, and there is no need for having the original image/watermark during this process.
3 Image Divider Unit Typically, not all objects in an image have the same value, and accordingly, they do not need the same level of protection. There are many examples of such images, however if we look at the cheque image illustrated in Fig. 3, we will observe that: the courtesy amount area is very important, and it is highly targeted by the attacker. A small modification of the contents of this area would make a great change of the cheque image. Whereas, modifying some part of the background would not make that difference. This is a typical scenario of what so called region-of-security-important (ROSI) approach. The image divider unit helps provide a practical solution to the above problem. When an image is dividing into small blocks, this unit takes into consideration some factors such as the significant part of the image, the quality (the resolution) of these areas, the overall cost and system performance. For instance, on cheque images in Fig. 4, the courtesy amount area may be divided into tiny block sizes (as small as 9x9 pixel block) to recognize small changes. The signature area of the cheque is also importance, but it always has bigger shape than that in the courtesy area, therefore it might be divided into 12x12 pixel block. The remaining areas in the cheque image will be divided into bigger sub-blocks; their dimensions are based on the image size and the computational resources available. The image divider unit is also responsible to set the LSBs of the first 81 bits of each image block to zero in order to reserve these positions for the generated watermark.
A Secure and Localizing Watermarking Technique for Image Authentication
763
4 String-Sequence Generation The string-sequence generator unit plays the main roll in the proposed technique. This unit utilizes the correlation coefficient statistic to produce small and unique repre-
Fig. 3. A cheque image having multi-level of security importance.
Fig. 4. The image is divided into sub-blocks in different sizes based on ROSI, where the cheque courtesy amount area is divided into 9x9 pixels blocks, the signature area is divided into 12x12 pixels, and the remaining parts is divided into 128x64 pixels.
sentation for a given image or any sub-block within it. Correlation coefficients are a useful and potentially powerful tool that statistically measures the relationship between two sets of variables, e.g., two adjacent columns or two adjacent rows in a given image. The relationships of the image-block pixels with regard to its neighbors are measured and combined all together in a way to produce one value called “stringsequence”. The correlation coefficient, between any two adjacent rows, or columns, A and B , can be calculated using Eq. (1),
where and are two pixel values located in the same row at two adjacent columns, or located in the same column at two adjacent rows. and are the averages n numbers of and respectively. Fig. 5, illustrates a row/column-wise correlation coefficients calculation. It shows how correlation coefficients preserve the relationship between a pixel and its 4connected neighborhoods. For example, pixel is compared with pixel in pixel in pixel in and with pixel in
764
A.H. Ouda and M.R. El-Sakka
The string-sequence for a given mxn block is calculated as follows: 1. Calculate the n-1 column-wise correlation coefficients using Eq. (1), where i = 1,···, n – 1
Fig. 5. Pixel representation of mxn image block, where cients, and is column-wise correlation coefficients.
is column-wise correlation coeffi-
2. Calculate the m-1 row-wise correlation coefficients
using Eq. (1), where
j =1,···, m – 1 3. Compute the value v using Eq. (2), i.e., is the summation of all values that produced from the above two steps. 4. Calculate the average of the image block w. 5. Calculate the string-sequence s using Eq. (3),
where is the smallest integer by which we can find z such that Note that, ddp (stands for Drop Decimal Point) is a function that drops the decimal point from a real number and makes it an integer number. For example if v=23.65908665321087 then ddp(v) becomes 2365908665321087. Note also that, in this work v and W are double precision variables with a 52-bits mantissa, and hence the string-sequence can be any positive integer that bounded to 16-digits in length. From the definition of the correlation coefficients, in some special cases the output of Eq. (3) might be the same for two different image blocks. To avoid these cases, the following transformations will be made to an image data before applying Eq. (3). Case 1: the pixels values are paired in a relation such that high values are paired with relatively high values, and low values are paired with relatively low values within a specific ratio. For example, consider the following two blocks: Block 2 : 50 45 90 Block 1 : 100 200 10 70 5 50 100 5 100 90 180 10 35 By applying Eq. (2) to both block 1 and block 2, we notice that v has the same values being 4. Hence, after applying Eq. (3), the string-sequences for block 1 and block 2 will be the same and will be equal to 50765641. Note that the averages of these two blocks are also the same.
A Secure and Localizing Watermarking Technique for Image Authentication
765
Solution: a dummy variable is added to each row and column, such that the value of these variables should be in increasing order, e.g., 1,2,...,n mode 256. This modulus is taken in order to ensure that these variables always run within the range from 0 to 255. Now, the above blocks become: 1 Block 1: 100 200 10 70 Block 2: 50 45 90 5 1 50 100 5 35 2 100 90 180 10 2 1 1 4 2 3 2 4 3 In this case, block 1 has: v = 15.6067501197491, w= 71.25, and the string-sequence = 156067551963116, and Block 2 has: v = 15.4975646811457, w = 71.25, and the string-sequence = 154975697577082. Case 2: When the image blocks having the same pixel values but not the same positions. For example, a block that is rotated, or flipped. Consider the following two blocks, Block 1: 120 120 170 120 Block 2 : 170 120 170 170 120 120 120 170 170 170 120 120 120 170 120 120 120 170 170 120 120 170 170 170 120 170 170 170 In this example, the value of v of both blocks is 0.333333333, and they have the same block average equal to 145. Hence the string-sequence of these blocks will be the same and will equal 1111111111132136. Solution: transform the image block to another domain, in which the position of each pixel is preserved. This transformation is calculated by Eq. (4):
Where is the transformation of the pixel at position i in a given row or column, and n is the length of the block row (row transformation), or the length of the block column (column transformation). For example consider the two blocks mentioned above (case 2), therefore they will be transformed to the following blocks (row transformation): Block 1: 183 247 105 119 Block 2: 233 247 105 169 183 247 55 169 233 41 55 119 183 41 55 119 183 41 105 119 183 41 105 169 183 41 105 169 After this transformation, block 1 has: v = 2.14609511595264, w= 128, and the string-sequence = 46057242483542, and Block 2 has: v = 3.10488641778782, w = 128, and the string-sequence = 964031966757064.
5 Image Block Signature Unit The image block signature adopts an error-correcting-code-based digital signature scheme to sign the string-sequences and produce the image watermark. This scheme is proposed by Courtois, el. al. [13]. Please refer to the articles in [14-18] for the theoretical background of this scheme. Courtois digital signature scheme gives short signatures of 81-bits with a security strength based on the difficulty of the syndrome decoding which was proven to be NP-complete [14]. This digital signature scheme is
766
A.H. Ouda and M.R. El-Sakka
based on Niederreiter’s cryptosystem [17] with public key a scrambled paritycheck matrix of a binary Goppa code. The signature of an image data is based on the idea that we search for the first random decodable syndrome s, such that we can find a vector z satisfying
The signature will be the vector z. The probability to find a random decodable syndrome, using the Goppa code, is 1/9!. The string-sequence generation unit is designed to provide such syndromes (see Section 4).
6 System Tests and Results Two main experiments were conducted on a database of 680 different images. These images are scanned at a resolution of 200 dots/inch. The produced images are 1274x552 pixels each. The first experiment assessed the collision resistance of the string-sequences, whereas the second experiment tested the altering location detection property.
6.1 Collision Resistance Experiment The string-sequences are called collision resistant, if it is hard to find two different image blocks having the same string-sequence. To test the satisfaction of this property, each image in the database is divided into non-overlapping small blocks. The sizes of these blocks are chosen to be 512x512, 256x256, 128x128, 64x64, 32x32, 16x16, 12x12, and 9x9. Note that, in some cases the block cannot completely tile the entire image, in such cases the block will be wrapped around the image boundary. The string-sequence is generated for each block in a given image. The smallest difference between any two string-sequences is calculated. Note that, if this difference is greater than zero, this means there is no collision. The results of this experiment are summarized in Table 1. Each row shows the used image block size, the average of the string-sequences among all blocks, the minimum difference of the string-sequences. The smallest number in the third column of Table 1 is 1065479, which is the smallest difference between any two stringsequences of a given image. We conclude that, the image block in each image can be
A Secure and Localizing Watermarking Technique for Image Authentication
767
Fig. 6. (a) The original image, (b) The watermarked image, (c) some modification in the image pixels, (d) the produced binary image after editing modifications.
represented by a unique string-sequence, even in a block size 9x9. The length of the string-sequences varies based on the image block sizes. The average of the differences of the image string-sequences in Table 1 shows how far the string-sequence is from collision.
6.2 Altering Location-Detection Property When a watermarking scheme is able to identify a modified pixel region in a given image, it satisfies the altering location detection property. To test this property, the image in Fig. 6(a) is used as an original image that is needed to be protected. The produced watermarked image is shown in Fig. 6(b). The watermarked image is modified as shown in Fig. 6(c). The extraction process produced the binary image shown in Fig. 6(d). This image shows that the modified areas have been successfully identified, and located.
7 Conclusion In this paper, a new block-based image-dependant watermarking technique is proposed. In this technique a correlation coefficient statistic is utilized to produce a small and unique representation (string-sequence) for a given image or any sub-block within it. These string-sequences are generated such that it easily converted to be decodable syndromes. An error-correcting-code digital signature scheme is used to sign the image data. Experimental results showed that the produced string-sequences are collision resistant. More precisely even if, after exhaustive search of the string-sequences, a collision were occurred then the two input image data will differ in what the human eyes cannot distinguish. This is because the string-sequence is produced from an image-depended hashing function. The experiments also showed that the performance of the proposed technique, both in terms of cryptographic security and the localization property, is superior to other counterparts available today.
768
A.H. Ouda and M.R. El-Sakka
Acknowledgement. This work was partially supported by the Ontario Graduate Scholarship (OGS). This support is greatly appreciated. Special thanks belong to Nicolas Sendrier, and Matthieu Finiasz for the useful support for testing their signature scheme.
References 1. 2.
3. 4.
5.
6.
7. 8.
9.
10.
11. 12. 13.
14. 15.
M. Yeung, F.Mintzer: An Invisible Watermarking Technique for Image Verification, IEEE International Conference on Image Processing, vol. 2, pp. 680–683, 1997. J. Fridrich, M. Goljan, and N. Memon: Further attacks on the Yeung-Mintzer fragile watermark, SPIE Photonics West, Electronic Imaging 2001, Security and Watermarking of Multimedia Contents II, vol. 3971, pp. 428–437, 2000. P. Wong: A Public Key Watermark for Image Verification and Authentication, IEEE International Conference on Image Processing, vol. I, pp. 455–459, 1998. N. Memon and P. Wong: Secret and Public Key Authentication Watermarking Schemes that Resist Vector Quantization Attack, SPIE International Conference on Security and Watermarking of Multimedia Contents II, vol. 3971, pp. 471–427, 2000. M. Holliman and N. Memon: Counterfeiting attacks on oblivious block-wise independent invisible watermarking schemes, IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 432-441, 2000. P. Wong and N. Memon: Secret and Public Key Image Watermarking Schemes for Image Authentication and Ownership Verification, IEEE Transactions On Image Processing, vol. 10, no. 10, pp. 1593–1601, 2001. R. Rivest: The MD5 message digest algorithm, Technical Report RFC1321, Internet Engineering Task Force, 1992. R. Rivest, A. Shamir, and L. Adleman: A method for obtaining digital signatures and public-key cryptosystems, Communications of the ACM, vol. 21, no. 2, pp. 120–126, 1978. A. Ouda and M. El-Sakka: Technical report on methods to correct the Wong-Memon image watermaking scheme, London, Ontario, University of Western Ontario, Allyn and Betty Taylor Library, no QA76.5.L653 no.603, 2003. J. Fridrich: Security of Fragile Authentication Watermarks with Localization, SPIE Photonic West, Electronic Imaging 2002, Security and Watermarking of Multimedia Contents, vol. 4675, pp. 691–700, 2002. J. Fridrich, M. Goljan, and A. Baldoza: New Fragile Authentication Watermark for Images, IEEE International Conference on Image Processing, vol. 1, pp. 446–449, 2000. M. Costa, Writing on dirty paper, IEEE Transactions on Information Theory, vol. 29, no. 3, pp. 439–441, 1983. N. Courtois, M. Finiasz, and N Sendrier: How to achieve a McEliece-based digital signature scheme, Advances in Cryptology - ASIACRYPT 2001, LNCS 2248, pp. 157–174, Springer-Verlag, 2001. R. McEliece E. Berlekamp and H. Tilborg: On the inherent intractability of certain coding problems, IEEE Transactions on Information Theory, vol. 24, no.3, pp. 384–386, 1978. R. McEliece: A public-key cryptosystem based on algebraic coding theory, Jet Propulsion Lab. DSN Progress Report, 1978.
A Secure and Localizing Watermarking Technique for Image Authentication
769
16. R. Deng, Y. Li and X. Wang: On the equivalence of mceliece’s and niederreiter’s public key cryptosystems, IEEE Transactions on Information Theory, vol. 40, no. 1, pp. 271–273, 1994. 17. H. Niederreiter: Knapsack-type cryptosystems and algebraic coding theory, In Problem, Contribution and Information Theory, vol. 15, pp. 159–166, 1986. 18. T. Cover: Enumerative source encoding, IEEE Transactions on Information Theory, vol. 19, pp. 73–77, 1973.
A Hardware Implementation of Fingerprint Verification for Secure Biometric Authentication Systems Yongwha Chung1, Daesung Moon2, Sung Bum Pan2, Min Kim3, and Kichul Kim3 1
Department of Computer and Information Science, Korea University, Chochiwon, KOREA
[email protected] 2
Biometrics Technology Research Team, ETRI, Daejeon, KOREA {daesung, sbpan}@etri.re.kr
3
Dept. of Electrical & Computer Eng., University of Seoul, Seoul, KOREA {minkim, kkim}@uos.ac.kr
Abstract. Using biometrics to authenticate a person’s identity has several advantages over the present practices of Personal Identification Numbers or passwords. To gain maximum security in the authentication system using biometrics, the computation of the authentication as well as the store of the biometric template has to take place in secure devices such as smart cards. However, it is challenging to integrate biometrics into secure devices because of limited resources(processing power and memory space). In this paper, we propose an area-time-accuracy efficient hardware design for a fingerprint matching system, which can be integrated into typical smart card chips. Experimental results show that the match operation can be completed in real time(190ms) on the proposed hardware, and the Equal Error Rate(EER) of 3.8% can be obtained. The hardware uses only 36K silicon area by using CMOS technology) and can be easily integrated into smart cards. In this scheme, all the critical information including the biometric template can be encapsulated within smart cards removing the risk of any data leaking out. Keywords: biometrics, fingerprint verification, Match-on-Card
1 Introduction In the modern electronic world, the authentication of a person is an important task. Using biometrics to authenticate a person’s identity has several advantages over the present practices of Personal Identification Numbers(PINs), passwords and smart cards that can be lost, forgotten, or stolen. In typical biometric authentication systems, the biometric templates are often stored in a central database. With the central storage of the biometric templates, there are open issues of misuse of the biometric templates such as the ‘Big Brother’ problem. To solve these open issues, the database can be distributed to millions of smart cards [1-6]. Most of the current implementations of the solution have a common characteristic that the biometric authentication process is accomplished out of the smart card introducing the risk of leaking out A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 770–777, 2004. © Springer-Verlag Berlin Heidelberg 2004
A Hardware Implementation of Fingerprint Verification
771
biometric information. To heighten the security level, the authentication operation needs to be performed within smart cards, not in external card readers. This system is called as Match-on-Card method. Note that standard PCs, on which typical biometric verification systems operate, have 2GHz CPU and 256Mbytes memory. On the contrary, a state-of-the-art smart card chip(5mm×5mm) can employ 32-bit ARM7 CPU, 256KBytes of ROM program memory, 72KBytes of EEPROM data memory, and 8KBytes of RAM at most. Since such a smart card chip has very limited memory, typical biometric verification algorithms may not be executed even on a state-of-the-art smart card successfully. To reduce the required memory space significantly, we have developed a memory efficient fingerprint matching algorithm by doing more computation with a special data structure[6]. However, the recognition rate of the algorithm is so low that it should be improved further. In this paper, we present a hardware architecture for the fingerprint matching system which can be implemented on Match-on-Card with high recognition rate. To the best of our knowledge, there has been no previous work reported for a hardware-based fingerprint matching embedded into a smart card chip. The rest of the paper is structured as follows. Section 2 explains a fingerprint authentication system for Match-on-Card. Section 3 describes the hardware architecture of minutiae matching system, and Section 4 shows experimental results. Conclusions are given in Section 5.
2 Fingerprint Authentication System for Match-on-Card A fingerprint authentication system has two phases: enrollment and verification. In the off-line enrollment phase, an enrolled fingerprint image is preprocessed, and the salient features derived from the fingerprint ridges, called minutiae, are extracted and stored. In the on-line verification phase, the similarity between the enrolled minutiae pattern and the input minutiae pattern is examined. Note that we use a typical representation of the minutiae such as x and y coordinates and angles of the ridge endings and bifurcations in the fingerprint image[7]. Fig.1 shows a fingerprint authentication system for Match-on-Card. PreProcessing refers to the refinement of the fingerprint image from a fingerprint scanner against the image distortion. Extraction refers to the extraction of features in the fingerprint image. After performing these steps on the card reader, some minutiae are detected and stored into a smart card as a template file. The template file includes the position, orientation, and type(ridge ending or bifurcation) of the minutiae. Then, a similarity between the Enrolled Minutiae pattern and the Input Minutiae pattern is examined in Match. Note that the Match is performed in the smart card. Thus, it can provide greater privacy to the user because the biometric data stays in the secure environment of the smart card. Details of the fingerprint Match-on-Card can be found in [6].
772
Y. Chung et al.
Fig. 1. Fingerprint Authentication System for Match-on-Card
3 Minutiae Matching Hardware 3.1 Details of Fingerprint Matching Fingerprint matching consists of alignment stage and matching stage. Input to the alignment stage includes a set of enrolled minutiae P and another set of minutiae Q extracted from input fingerprint image. Q is transformed with a similarity transformation (rotation and translation) method in the alignment stage. The result of similarity transformation is a set of minutiae R. Input to the matching stage includes the enrolled minutiae P and the minutiae R transformed from Q in the alignment stage. In the matching stage, R is compared with P and a matching score is computed. Equation (1) shows notations for enrolled minutiae P, minutiae Q extracted from input fingerprint image and minutiae R transformed from Q.
and represent spaIn Equation (1), tial position, orientation and type of the minutiae of each set P, Q, R, respectively. The number of minutiae in P and Q are represented by m and n, respectively. and are the translation and rotation parameters, respectively. is decided according to the instances of has n minutiae.
In the alignment stage, selection of reference points and the similarity transformation are performed. During the selection of reference points, two points are selected arbitrarily, one from each P and Q. The selected points are called as reference points and represented as and During the transformation process, Q is transformed to according to Equation (3). is determined from and as shown in Equation (2).
A Hardware Implementation of Fingerprint Verification
773
The alignment stage is repeatedly performed for every possible pair of the reference points. That is, Q is transformed to according to all possible combinations of P is compared with every and, each time, the number of matching points is counted. The matching score is decided upon the maximum number of matching points.
3.2 Hardware Architecture Fig. 2 shows the hardware structure of the minutiae matching system consisting of REGISTER FILE and CORE. REGISTER FILE contains registers which communicate with microprocessor through system bus. Because of its popularity, AMBA AHB has been used as the system bus[8]. FILE and INPUT registers in the REGISTER FILE store the data of Enrolled Minutiae and Input Minutiae, respectively.
Fig. 2. Hardware Architecture of Minutiae Matching System
774
Y. Chung et al.
PARAMETER registers provide parameter values needed for the control of the hardware. RESULT registers contain the result of match operation. CORE is composed of three modules - PREPROCESSING, TRANSFORM and COMPARISON. PREPROCESSING module selects reference points and generates parameters for the similarity transformation according to the selected reference points. TRANSFORM module performs the similarity transformation for the selected reference points. COMPARISON module computes a matching score. The hardware structure of each module is shown in Fig 3. Basic components of PREPROCESSING module are multiplexors, adders and logics for comparison. PREPROCESSING module performs two tasks. First, it selects reference points in Enrolled Minutia and Input Minutia, and computes the difference between the reference points to decide whether to advance for further processing with current reference points. If the difference between the reference points exceeds a predefined threshold, further processes in the alignment stage and the matching stage are not performed. Second, it computes the difference between Input Minutia and reference minutia that is required in the transformation process in the alignment stage. TRANSFORM module consists of adders, multipliers and a ROM which stores trigonometric function coefficients. Because the module performs the most time-consuming computation during the minutiae matching, four multipliers are employed in parallel. Furthermore, by storing trigonometric function coefficients in the ROM, processing time has been shortened. TRANSFORM module transforms Input Minutia to Transformed Minutia. COMPARISON module consists of adders, a counter, registers, and logic components for comparison. A matching score is computed by comparing positions, orientations, and types of Enrolled Minutia and Transformed Minutia. The maximum value among the matching score is stored in the RESULT register. Note that, a pipeline methodology can be applied to improve the speed of the minutiae matching further. That is, a 2-stage pipeline can be obtained by grouping PREPROCESSING and TRANSFORM together as the first stage and making COMPARISON as the second stage. Further speedup is also possible by employing more logic components for comparison in parallel to COMPARISON module. These features are not used in the current implementation since they use more resources.
4 Experimental Results The minutiae matching system was implemented on Xilinx Virtex-II(XC2V2000), and AMBA Advanced High-performance Bus was used to integrate it into smart card chips easily. Table 1 shows the characteristics of the hardware. Given 30 minutiae extracted from a fingerprint image, the matching with another 30 minutiae stored in a smart card can be completed in 190msec using a small area(36K gates).
A Hardware Implementation of Fingerprint Verification
Fig. 3. Structure of Three Modules in CORE
775
776
Y. Chung et al.
Table 2 shows the comparison of EER(Equal Error Rate) between various approaches. Note that the area estimations are based on the CMOS technology used in typical smart card chips. A specially designed, software-based approach to reduce the RAM space[4] can perform the same operation in 900msec by using a 6.8KB of RAM However, its error rate(6.0%) is higher than that of the hardware design(3.8%). Note that the typical software-based approach cannot perform the same operation in a smart card chip, whose area is restricted to with current technology because it requires a 300KB of RAM
5 Conclusions In this paper, we have presented a hardware solution to a fingerprint minutiae pattern matching problem. The hardware can be easily integrated into an area-constrained smart card environment removing the possibility of leaking biometric information. To evaluate the effectiveness of the proposed hardware design, we compared it with software-based approaches in terms of area, time, and error rate. Compared to the specially designed, software-based approach, the proposed hardware design can utilize its resource(area) more efficiently and complete the matching computation more quickly with less error rate by a factor of 14.
References [1] H. Dreifus and T. Monk, Smart Cards, John Wiley & Sons, 1997. [2] G. Hachez, F. Koeune, and J. Quisquater, “Biometrics, Access Control, Smart Cards: A Not So Simple Combination,” in Proc. 4th Working Conf. on Smart Card Research and Advanced Applications, pp. 273-288, 2000. [3] R. Sanchez-Reillo, “Smart Card Information and Operations using Biometrics,” IEEE AEES Mag., pp. 3-6, 2001. [4] N. Kaku, T. Murayama, S. Yamamoto, “Fingerprint Authentication System for Smart Cards,” Proc. of IFIP on E-commerce, E-business, E-government, pp. 97-112, 2001. [5] L. Bechelli, S. Bistarelli, and A. Vaccarelli, “Biometrics Authentication with Smart Card,” IIT Technical Report, 2002.
A Hardware Implementation of Fingerprint Verification [6]
[7] [8]
777
S. Pan, et al., “A Memory-Efficient Fingerprint Verification Algorithm using A MultiResolution Accumulator Array for Match-on-Card, ” ETRI Journal, Vol. 25, No. 3, pp. 179-186, 2003. A. Jain, L. Hong, and R. Bolle, “On-line Fingerprint Verification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 4, pp. 302-313, 1997. ARM limited, AMBA Specification, Rev. 2.0, 1999.
Inter-frame Differential Energy Video Watermarking Algorithm Based on Compressed Domain Lijun Wang1, Hongxun Yao1, Shaohui Liu1, Wen Gao1,2, and Yazhou Liu1 1
Department of Computer Science, Harbin Institute of Technology, Harbin, P.R. China, 150001 {ljwang,yhx,shaol,yzliu}@vilab.hit.edu.cn 2
Institute of Computing Technology, Chinese Academy of Science
[email protected]
Abstract. This paper presents a novel video watermarking scheme based on the correlation between adjacent frames in MPEG-4 video bitstream. The watermark is embedded in P frames during encoding and retrieved directly from the compressed bitstream without original video. In the scheme, macroblock (MB) coding modes provided by the encoder are used to decide the location of embedding, energy mapping function guarantees the visual quality, reembedding and optimal retrieving strategy guarantees the synchronization between embedding and retrieving. The experimental results indicate that the scheme has several advantages including low computational complexity, strong robustness to re-encoding, low visual degradation and high watermark payload in the low bit rate environment compared with other watermarking systems in the compressed domain.
1 Introduction With the popular dissemination of digital products like video and audio, copyright protection has become a key problem. Video watermarking is an efficient technology to protect the copyrights of digital video. Many video watermarking schemes have been proposed and they can be classified into 3 classes according to the domains. In the first class, the watermark is embedded in the spatial domain. Earlier methods have been developed to embed the watermark in the Least Significant Bits (LSB) of image pixels [1], but it is not resistant to the attacks such as filtering and compression. In the second class, the watermark is embedded in the frequency domain. Many schemes have been proposed to embed the watermark in the DCT [2-3] or DWT [4] coefficients. These watermarks are more robust than the ones belonging to the first class, but they require high computational complexity. In the third class, the watermark is embedded in the compressed domain. The typical scheme is to embed the watermark in VLC proposed by Cross [5]. Other schemes are developed to embed the watermark in the residual of motion vectors by modifying the motion vectors [6]. These video watermarking schemes can gain low computational complexity or improve the robustness to the compression, but they are not resistant to some attacks such as re-encoding and frame dropping. Lagendijk [7] A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 778–785, 2004. © Springer-Verlag Berlin Heidelberg 2004
Inter-frame Differential Energy Video Watermarking Algorithm
779
has developed an algorithm called Extended Differential Energy Watermarking (XDEW) in which the watermark was embedded in both I-frames and P-frames. The algorithm is performed in the low bit-rate environment and can get good performance on the robustness to re-encoding. But it is complicated from a computational standpoint. Compared with the XDEW scheme, this paper proposes a new video watermarking scheme IFDE (inter-frame differential energy) in the standard MPEG-4 video bitstream based on the correlation between the adjacent frames. In this scheme, embedding is performed with the video compression and retrieved from the bitstream. The most remarkable characteristic of the algorithm is the energy of watermark is regarded as an indispensable part of video data. So it can get good performance on visual quality and robustness to compression. The rest of this paper is organized as follows. In Section2, IFDE algorithm is explained in detail. In Section 3, the robustness of watermarking is analysis. In Section 4, the experimental results are presented. Finally, we present the conclusion of our experiments in Section 5.
2 IFDE Algorithm P and B frames are coded after motion prediction and compensation during encoding, so they have lower energy content compared with I-frames. But they take up the majority of video bitstream since there is one I frame in a Group Of Pictures. If we embed the watermark in this part of space with an elaborate technique, the high watermark payload can be obtained in a low bit rate environment.
2.1 The Position of Embedding the Watermark Digital video watermarking is required to be blind since video is a media with a large data space. We develop a video blind watermarking algorithm based on the high correlation between the adjacent frames. During the encoding of P or B frames, the residual of a block is the difference between the block in the current frame and its predicted block in the reference frame. If we adjust the residual to match the watermark bitstream in a required fidelity, we can realize the blind watermarking. There are several different MB coding modes in P-frame such as INTER 16, INTER8 and INTRA. The performance of watermarking is different due to embedding in different kinds of MB. We select the luma blocks of INTER 16 MB as the basic unit of embedding the watermark because all sub-blocks (4 luma blocks) of INTER 16 MB have the same motion vector and they will cause less drift effects during re-encoding when it is coded at the same bit-rate. The experimental results proved this point.
2.2 The Embedding Process The embedding procedure is performed as follows: 1. For a luma block of an INTER 16 MB, M DCT quantized coefficients in a limited scope are selected before a Cut Off (cut_off) point in zigzag-
780
L. Wang et al.
scanning order, which construct the embedded space. If there are not as many as M coefficients that can be selected, this block would not be embedded. 2. The watermark can be any useful information, and in our experiments it is a random sequence of 1s and –1s generated by a key. Corresponding to M selected coefficients, the watermark sequence is divided into several Groups of Watermark bits (GOW) with the size of M. Every selected coefficient corresponds to one watermark bit and we use Energy Mapping Function (formula 1) to embed the watermark.
Where x is the value of one coefficient, abs (x) represents the absolute value of x , w is one watermark bit, a and b are the bounds of embedded space and c is the partition point (in our experiments, a, b and c are 2,3 and 2). As the formula shows, the value of the coefficient represents the energy of the watermark bit. From this view, our scheme is named as IFDE that means the differential energy between the current and predicted MB. 3. The same GOW is embedded in the 41uma blocks of one INTER 16 MB to improve the robustness of watermarking if the blocks can be embedded. The philosophy of IFDE: If the data space is defined as a set of all non-zero DCT quantized coefficients, the embedded space is a set of the selected coefficients. If the information of embedded space represents the hidden information itself, the watermarking system is optimal. In our algorithm, the coefficients are selected from a limited range. The modification of values will not cause much visual degradation and this characteristic makes it possible to design our function. From the formula 1, if the selected coefficient matches the watermark bit, it needn’t be modified. It is possible that the Bit Rate will increase if more coefficients are adjusted large. But the number of selected coefficients is limited, so the amount of increased bits is small. As the experimental results proved, the increased bit rate is limited in 0.1 percent by testing the standard MPEG-4 video sequences (Claire, akiyo, news, silent).
2.3 The Extracting Process In contrast with the embedding process, we use the following procedure to extract the watermark bits. 1 In the luma block of the INTER 16 MB, M coefficients before a cut off point are selected. If there are not as many as M coefficients, the block isn’t watermarked. 2 The energy of one watermark bit is calculated by adding all the corresponding dequantized values that represent it as the formula (2) and (3).
Inter-frame Differential Energy Video Watermarking Algorithm
781
3 The watermark bit can be obtained according to the sign of the energy sum (formula (4)).
In the formula 2, de _ quant (x) represents the de-quantized value of coefficient x, Esign (x) represents the sign of x according to its absolute value, represents the energy value corresponding to the Kth watermark bit and represents the Kth watermark bit. As the formulas show, the embedding energy is enlarged to guarantee extracting watermark bits exactly because the sign of energy will remain unchangeable even though some of coefficient values flip outside their original interval during reencoding. Figure 1 shows an example of the process of watermarking (the data in the figure come from the experiments).
3 The Robustness to Re-encoding As the experimental results prove, some additive INTER16 MBs are included or some embedded INTER16 MBs are lost (changed to be Skipped MB) caused by requantization and motion drift when the video is re-encoded with the same quantized step. When this scenario occurs, the synchronization between embedding and retrieving is destroyed. We improve the robustness of watermarking by re-embedding and optimal retrieving strategy to guarantee the synchronization between embedding and retrieving.
3.1 Re-embedding Strategy As the description of embedding procedure, we embed the same GOW in every luma block of one selected MB. For the continuous P selected MBs, we also embed the same GOW in them and these P MBs construct a Group of Embedded MBs (GEMB). When one selected block or MB is destroyed, GOW still can be retrieved from other blocks in GEMB.
3.2 Optimal Retrieving Strategy In the decoder, the rule of retrieving watermark is if the current retrieved GOW and the latest dependable GOW are approximate enough (Similar Rate is enough large, as
782
L. Wang et al.
formula 5), the current MB belongs to the current GEMB, otherwise, the following Skipped Number (Max_SN) GOWs are extracted and compared each other. The most similar two of Max_SN GOWs are found and the corresponding MBs belong to the next GEMB.
Fig. 1. The demonstration of IFDE watermarking algorithm
In the formula 5, Dissimilar _ num denotes the amount of different bits in the two GOWs and len(wm)denotes the length of one GOW (len(wm)=M). In our experiments, we set T as 0.75. If they are approximate enough the two GOWs belong to the same GEMB and the dependable GOW are updated (formula 6). Where Dep_GOW denotes the dependable GOW (initialized by the watermark bits), Curr_ GOW denotes the current retrieved GOW. From the process described above, SR is the criterion of segmenting GEMBs.
4 Experimental Results The experiment is performed in the standard MPEG-4 video bitstream at the frame rate of 30fps. The test sequences are formatted with 4:2:0 with the size of 176×144
Inter-frame Differential Energy Video Watermarking Algorithm
783
pixels. We test the performance of IFDE algorithm in terms of watermark payload, robustness to re-encoding and frame dropping and visual quality impact.
4.1 Visual Quality and Watermark Payload We compare the video quality with and without watermarking as Table 1.
The performance parameters are explained as follows. 1. APSNR are the average of PSNR of Y component, which weigh the visual quality of the compressed video. 2. Embedded_bits are the amount of watermark bits embedded in the video, which weighs the watermark payload. 3. BIR are Bit Increased Rate which weighs the increased amount after watermarked (formula 7).
Where watermarked _BR denotes the Bit Rate with watermark and original_BR denotes the Bit Rate without watermark. This parameter weighs the effect of compression efficiency caused by watermarking. 4. WBR denotes the Watermark Bit Rate which weighs the watermark payload (formula 8).
Where total_bits are the bits of watermarked video. 5. ABER denotes the Bit Error Rate after re-encoding which weighs the robustness to re-encoding (formula 9).
Where Error_bits denotes the number of error retrieved watermark bits. From Table 1, the degradation caused by watermarking to the coding efficiency of MPEG-4 is almost unnoticeable (the modification of SNR and BIR are very small compared with encoding without watermarking). BIR is limited in 0.1 percent and WBR can get the largest payload of 110 bps.
784
L. Wang et al.
Table 2 shows the watermark payload at different parameters. The watermark payload can be large if M is large and P is small though the degradation of the coding efficiency will increase (BIR increases and PSNR decreases), but the algorithm can guarantee high coding efficiency from our experimental results (the largest BIR is 0.1347 percent).
Figure 2 shows the visual quality at different WBR and Figure 3 shows PSNR curve. From figure 2, PSNR tends to decrease when WBR is increased, but it is high as a whole. Compared with XDEW algorithm, IFDE has better visual quality and higher watermark payload.
Fig. 2. The ralation between WBR and PSNR (Claire_qcif, 300 frames, 30fps, M=8, P=8)
Fig. 3. Frame by frame PSNR measurement (Claire, 300 f, 30fps, 72bps watermark payload)
Fig. 4. The relation curve between WBR and ABER (Claire_qcif, 300 frames, 30fps)
Inter-frame Differential Energy Video Watermarking Algorithm
785
4.2 The Robustness of Watermarking As Table 1 shows, the robustness of watermarking to re-encoding is very strong compared with XDEW algorithm. Figure 4 shows ABER curve at different WBR. As the experimental results show, ABER can be limited in 10 percent when the video is re-encoded with the same quantized step.
5 Conclusion This paper proposes IFDE algorithm by embedding the watermark bits in P frames. Repeated embedding and optimal retrieving are used to improve the performance of this system. The scheme is robust to re-encoding. Especially, a large watermark payload can be achieved in a low bit rate environment. This artifact introduced by this scheme is imperceptible by measuring the average PSNR of P frames in the watermarked video stream. Additionally, the computational complexity of the scheme is low since the time complexity of our algorithm is linear. This scheme is general and can be extended easily into other video standards such as MPEG-2 and JVT.
Reference 1. B.C. Mobasseri, M. J. Sieffert and R. J. Simard: Content authentication and tamper detection in digital video. IEEE ICIP, Vol.1, (2000) 458–461 2. F. Hartung and B. Girod: Digital watermarking of MPEG-2 coded video in the bitstream domain. ICASSP-97, Vol.4, (1997) 2621–2624 3. I.J.Cox, J.Killian, T.Leighton and T.Shamoon: Secure spread spectrum watermarking for multimedia. IEEE Trans. On Image Processing, Vol.6(12), (1997) 1673–1687 4. Xiaoyun Wu, Wenwu Zhu, Zixiang Xiong, Ya-Qin Zhang. Object-based multiresolution watermarking of images and video. IEEE ISCS 2000, Vol.1, (2000) 212–215 5. D. Cross, B. G. Mobasseri. Watermarking for sef-authentication of compressed video. IEEE ICIP, Vol.2, (2002) 913–916 6. Yuanjun Dai, Lihe Zhang and Yixian Yang: A new method of MPEG video watermarking technology. IEEE ICCT, (2003) 1845–1847 7. I. Setyawan and R. L. Lagendijk: Low bit rate video watermarking using temporally Extended Differential Energy Watermarking (XDEW) algorithm. Proc. Security and Watermarking of Multimedia Contents III, Vol. 4314, (2001) 73–84
Improving DTW for Online Handwritten Signature Verification M. Wirotius1,2, J.Y. Ramel1, and N. Vincent3 1 Laboratoire d’Informatique, 64, avenue Jean Portalis 37200 Tours FRANCE AtosOrigin, 19, rue de la Vallée Maillard, BP 1311, 41013 Blois Cedex FRANCE 3 Laboratoire SIP, 45, rue des Saints-Pères, 75270 Paris Cedex 06 FRANCE
2
Abstract. Authentication by handwritten on-line signature is one of the most accepted authentication system. In a way, it is based on biometrics. It is embedded in our cultural habits and easy to use. The aim of this paper is to study two possible improvements in Dynamic Time Warping: matching process and distance computation. After applying a polygonal approximation on the signatures, we test different approaches on the authentication problem. Usually, the information used to match the on-line signatures are the coordinates or the speed at the input data points. First, as far as the matching is concerned, we investigate other possibilities relying on the local information at each point. Next, we also test several methods to take into account local information to compute the distance. To limit genuine signature rejections, we use the result of the matching as a way to detect forgery and we also modify the computation of the distance. Finally, we evaluate the different approaches on a base of 800 signatures. The results obtained show an amelioration of the classical use of DTW.
1
Introduction
The use of handwritten signature for authentication relies on the assumption that instinct movements are more involved in the signing achievement than conscious acts. This means that some characteristics of the signature are stable and constant for a signing person. So, on-line signature can be considered as a comportmental biometric method. The term authentication covers in fact two sub problems: identification and verification. Signature identification has to determine, from a signature data base, the person to which the signature is closest. Verification is a quite different notion as it does not rely on a data base, it is checked whether the signature tested was realised by the person that he or she claims to be. Here we are more concerned with a verification problem than with identification. Besides the evaluation of a system cannot be achieved using only the classical error rate because all the errors do not have the same impact and the constraints of the application may differ. We must differentiate the false accepted rate (FAR) indicating forgeries that are not pointed out by the system and the false rejected rate (FRR) indicating the genuine signatures that are rejected. The values of these parameters have to be tuned according to the application. For an industrial product, the value of FRR must be low in order to avoid repetition of the authentication phase when the authorized person has been denied and must have a new trial. Here, our objective is to A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 786–793, 2004. © Springer-Verlag Berlin Heidelberg 2004
Improving DTW for Online Handwritten Signature Verification
787
keep the FRR value lower than 2%, users of authentication systems tolerate no higher value. Maybe the most common approach to this problem of on-line signature verification is the use of string comparison. We have chosen Dynamic Time Warping (DTW) and the aim of the paper is to compare different ways to achieve it, what sequence is to be considered and how to compute a similarity measurement. After the principle of DTW is presented, we will thus recall various matching approaches and explain their respective interests. The second part of the paper presents some methods to compute distance between the signatures according to the best previous matching found using DTW. In order not to reject genuine signatures, we have modified the computation of the distance and we have introduced a new step between the matching and the computation of the distance which consists in evaluating the quality of the matching. Finally, after presenting our test database, we will compare the results obtained with the different combinations of matching and distances.
2
Dynamic Time Warping
One of the most important difficulties in authentication using handwritten signatures is the choice of the comparison method. On-line signatures are given as a sequence of points sorted with respect to acquisition time. Since two signatures of the same person cannot be completely identical, we must make use of a measure that takes into account this variability. Indeed, two signatures cannot have exactly the same timing, besides these timing differences are not linear. Dynamic Time Warping is an interesting tool because it is a method that realises a point wise correspondence. It is not sensitive to small variations in the timing. Dynamic Time Warping is an application of the techniques of dynamic programming developed by Bellman in the Fifties [1]. It is particularly used in the domain of speech recognition. This method allows finding, for each element of one sequence, the best corresponding element in the other one according to some metric [4]. Here the most intuitive metric used is the usual spatial Euclidean distance between the signature points. Once this correspondence is found, we compute the distance between the two sequences (curves) by adding the distances between corresponding points. We have chosen to normalise the distance by dividing the sum result by the number of correspondences. The transformations allowed in the correspondence are stretching or compression along the temporal axis of a signal. The aim of these local adjustments is to minimize the difference between the two signals. Further more, we distinguish two approaches: asymmetrical and symmetrical. In the asymmetrical case, we seek to establish a correspondence between a sample and a pattern whereas, in the symmetrical case, the correspondence is sought both sample-pattern and patternsample. This second criterion seems to provide better results [3]. So this is the method we selected. This comparative method of curves is the most used in the field of authentication by on-line handwritten signature. Calculating distances between signatures with DTW allows us to have a verification system more flexible, more efficient and more adaptive than the systems based on computed features processed by neural networks or Hidden Markov Models, as the training phase can be
788
M. Wirotius, J.Y. Ramel, and N. Vincent
incremental. This aspect is very important when we think of elaborating an authentication method that takes into account the evolution of the signature on a long period.
3
Matching
Since we consider the coordinates of the signature points, before comparing the signatures it is necessary to achieve a preprocessing in order to normalise the signatures because a distance measurement is not invariant toward transformations, such as rotation or translation, applied with different parameters on each of the elements we are comparing. The normalisation is done in three stages. First of all, the rotation is performed by aligning the principal inertia axis with horizontal axis. Then, a scale change is carried out so that signatures are all contained in a surrounding rectangle with the same width. The last stage is a translation of the gravity centre of the set of points to the centre of the coordinate system.
3.1
Input Data
As a previous experiment on DTW has shown the use of all the signature points does not give the best results, we have chosen to select more representative and stable points as the vertices of a polygonal approximation achieved using the method proposed by Wall [2] [5]. This approximation method introduces a new vertex each time the error made by replacing the curve by a straight segment becomes too important. The error is based on the calculation of an area and on some threshold. This threshold represents the cumulated error authorized by area unit, which is some area between the curve and the approximating segment at each step of the trials. EINBETTENThe condition for starting a new segment is: Error >Lg×Epsilon where Lg denotes the segment length and Epsilon is a constant. Thus we determine the vertices of the polygonal line. Here is an iterative process and consequently it presents two advantages. On the one hand, it is not necessary to store all the points of the curve and, on the other hand; the points of polygonal approximation are extracted according to their appearance order in the curve. Nevertheless, this algorithm requires choosing one parameter linked to the quality of the desired approximation. We have retained two methods to choose the value of this parameter: first, a fixed threshold, valid for all the signatures independently to the signer, and second, a threshold that is optimised in an automatic way according to the author of the signature. As the results of authentication are better with an individualized threshold, this is the solution we use. From these sequences of vertices, we define the vectors associated with the segments between two selected consecutive points. As we want to find the best matching between two signatures, it is natural to compare the shape of the vectors. So, from these vectors, we extract three characteristics: the length, the duration and the absolute angle. The length and the duration of the vector are normalised respectively by the length and the duration of the whole signature. The absolute angle refers to the horizontal direction. All this information can be used independently for the matching and we present now some comparisons.
Improving DTW for Online Handwritten Signature Verification
3.2
789
Coordinate Approach
We can use the coordinates of the vertices obtained by the polygonal approximation to perform the matching. The matching quality corresponds to the number of errors between the matching realised in an automatic way and the matching that would be realised manually. Figure 1 shows the matching between two genuine signatures and between a genuine signature and a forgery. The result of the matching with the genuine signature is rather good because it is very close to the matching we expect. As the result of the matching is very different from the previous one, many vertices of the genuine signature have more than one vertex match in the forgery; it seems that this could also define a criterion for forgeries detection.
Fig. 1. Illustration of the matching using the coordinates of the vertices in (a) for a genuine signature and in (b) for a forgery.
3.3
Length, Duration, or Angle Approaches
The other parameters we have extracted from the polygonal approximation could be used for the same purpose. As can be observed on figure 2, the result of the matching according to the length of the vectors between two genuine signatures is not the one we could have expected. Indeed, the result of this matching is very different from the human one. The information of length is not sufficient to represent a vector. By representing a vector only by its length, we lose too much information.
Fig. 2. Illustrations of the matching using the vector length in (a) for a genuine signature and in (b) for a forgery.
The second approach consists in using the duration of the vector as the only available information. The result of this approach is quite similar to the previous one. There is not enough information to perform the matching. As previously, there are errors in the
790
M. Wirotius, J.Y. Ramel, and N. Vincent
matching due to the fact that, considering only one criterion like the duration or the length of the vectors, many vectors happen to have several corresponding vectors. In the third approach, we use the absolute angle of the vector as the only available information. Here the result is better than in the previous cases because there is much variation in this criterion. As it still presents errors with respect to the expected matching, we decide to carry out a matching based on the coordinates of the vertices. In fact, using the coordinates of these points is quite similar to the achievement of the fusion all the different data used in each of the approaches. For these reasons, we use the coordinate matching in the following tests.
4
Similarity Measurement
Usually, the final distance between the signatures is the sum of the Euclidean distances between the corresponding points of the two signatures. In order to improve the signification of the distance between the signatures, we decide to test different types of information from the classical approaches only using spatial point comparison.
4.1
Vector Approach
As the coordinates are used for the matching, we explore new ways to compute the distance between signatures by adding local complementary information instead of using the coordinates again. Here, three types of information are considered - length, duration and angle - to compare vector lists. First, the local information used is the ratio between the length of the previous vector and the next ones, and we define distance between two vectors as the absolute difference between the two ratios. Second, the local information used is the ratio between the duration of the previous and the next vectors. The distance is computed as previously. Thirdly, the local information added is the angle between previous and next vectors. The distance between the two vectors is the difference between the two angles measure. All the angles are absolute ones, comprised between 0 and 360°.
4.2
Coordinate Approach
Finally, we apply the classical approach to allow some evaluation of the previous approaches. As our objective is to minimize the rejection of genuine signatures, we propose to modify the computation of the distance between signatures. Usually in case of multiple matching, all the distances between one point and all the corresponding ones are added. Here, we only compute the distance between the first corresponding points in order to avoid cumulating errors. As this approach could reduce also the distance between the genuine and false signatures, before computing the distance between the signatures, we achieve a global estimation of the quality of the matching. For that purpose, we compare the number of times when one point in a signature has more than one corresponding point in the other signature. This defines a
Improving DTW for Online Handwritten Signature Verification
791
new verification criterion. If the number of multiple matching is too large, this indicates that the signature tested is a forgery.
5
Evaluation Method
To evaluate the different matching methods and the computation of distances between signatures, we made use of a base of 800 signatures realised by 40 writers. Among the 20 genuine signatures of each signatory, 5 are used to elaborate the patterns of the signature and the others are used for the tests. Each signer is represented by 5 patterns: one for each training signature.
5.1
Authentication Strategy
In order not to apply time consuming processing, as the computation of DTW, when it is obvious that the signatures to be compared are totally different, we implement a “coarse to fine” approach. For that, first we detect the forgeries that are very far from the training patterns by using fast and simple methods. After that first step, by use of more elaborated methods, we consider forgeries closer to the pattern model. The principal constraint of the first stage is not to reject genuine signatures. As the characteristics used must be relatively stable, we chose global characteristics: length and duration of the signature. Let Lt and Dt denote respectively the length and the duration of the tested signature, and Li and Di respectively the length and the duration of the training signature in a set of n models. The decision rule can be expressed as: If then the signature is considered as a forgery, otherwise we carry on with the second stage. We apply the same principle with the duration of the signature. This first stage allows us to detect 58% of forgeries and to accept 99.8% of genuine signatures.
5.2
Comparison by DTW
The aim of the next stage is to detect the forgeries that are not detected during the previous stages. Let St be the tested signature and Si the training signature. Before computing the distance between the signatures, we compare the matching obtained between training signatures with the matching performed between St and Si. Let Matching(S1,S2) be the number of points that have more than one corresponding point considering the matching of the two signatures S1 and S2. The decision rule is: If then the signature is considered as possibly genuine else the signature is rejected. This step allows to detect 75% of forgeries and to accept 99.5% of genuine signatures. After performing this test, we compute the distance between the two signatures St and Si. The decision rule is: If then the signature is considered as genuine else
792
M. Wirotius, J.Y. Ramel, and N. Vincent
the signature is rejected. We chose to make alpha parameter evolve in order to define different authentication systems more or less tolerant. The alpha value is independent from the signer. The quality of these systems can be represented in a two-dimensional space indicating FAR and FRR values.
6
Results
The tests realised considering the vector (angle, length, duration) approach for computing the distance between the signatures do not give good results. The best result obtained is given by considering the ratio of duration between successive vectors. We obtain a value of EER equal to 20% and a value of FAR of 37.5% for a FRR equal to 2%. This result underlines the importance of considering the coordinates of the points in the computation of the distance because there is a large number of close vectors considering these characteristics. To evaluate the methods introduced to reduce the number of genuine signatures rejected, the results obtained with the modified DTW are compared with those obtained with the classical methods.
Fig. 3. FAR vs FRR
The results show the importance of taking into account the quality of the matching before computing the distance and the impact of the modification of the computation of the distance. The distance between two signatures is not sufficient to detect a forgery. In fact, there are some cases where the distance is low whereas the matching
Improving DTW for Online Handwritten Signature Verification
793
seems to underline the difference between the signatures. The modification of the computation of the distance allows to reduce the differences between two genuine signatures. In fact, if a point has several corresponding points, it means that the associated vector is decomposed in several vectors. So the classical distance between curves induces an accumulation of errors. The last aspect shown in table 1 is that these modifications are complementary since the best result is obtained by their combination. Thus we obtain a reduction of 9% of EER and a reduction of FAR of 33.3% for a value of FRR equal to 2%.
7
Conclusion
We have shown through different experiments of several approaches it is possible to improve the matching deduced from DTW, we observe that the best matching is obtained by using the coordinates of the points after the polygonal approximation. But the matching could be improved when a succession of small vectors occurs. In fact, if the coordinates of the points are close, errors could appear. So taking into account local information should allow to avoid matching errors. When we consider the matching quality, before computing the distance, it is possible to detect forgeries that are close to the genuine signature with respect to the distance. The measurement of the matching quality could be more precise and other ways of measurement could be used. For the computation of the distance, taking into account the difference of length, duration or angle between two vectors instead of the coordinate of the points does not allow to improve the authentication. Modifying the computation of the distance between two signatures allows to reduce the distance between genuine signatures by not taking into account small differences in the polygonal approximation. Even if the use of the coordinates of the points gives the best results, we think the combination with local information should improve the results. So, one of our principal prospects is fusion of local information with coordinates for computing distance between signatures.
References 1. 2.
3. 4.
5.
R. Bellman, “Dynamic Programming,” Princeton Univ. Press, 1957. G. Dimauro, S. Impedovo, R. Modugno, G. Pirlo, L. Sarcinella, “Analysis of Stability in Hand-Written Dynamic Signatures,” Proceedings of the 8th Int. Workshop on Frontiers in Handwriting recognition (IWFHR’02), Ontario, Canada, pp. 259-263. T. Hastie, E. Kishon, M. Clark and J. Fan, “A Model for Signature Verification,” Technical report, AT&T Bell Laboratories, 1992. R. Plamondon and M. Parizeau, “Signature verification from position, velocity and acceleration signals : A comparative Study,” Proceedings of the 9th Int. Conf. on Pattern Recognition (ICPR’88), vol. I., Rome, Italy, pp. 260-265,1988. K. Wall and P. Danielsson, “A fast sequential method for polygonal approximation of digitised curves,” Computer Vision, Graphics and Image Processing, 1984, vol.28, p.220227.
Distribution of Watermark According to Image Complexity for Higher Stability Mansour Jamzad1 and Farzin Yaghmaee2 1
Dept. of Computer Engineering, Sharif University of Technology, Tehran, Iran
[email protected] 2
Dept. of Electrical Engineering, Semnan University, Semnan, Iran The Center for Theretical Physics and Mathematics, Computer Science Research Center, Niyavaran, Tehran, Iran.
[email protected]. ir
Abstract. One of the main objectives of all watermarking algorithms is to provide a secure method for detecting all or part of the watermark pattern in case of usual attacks to a watermarked image. In this paper we introduce a method that is suitable for any spatial domain watermarking algorithms, such that, it can provide a measure for the level of robustness when a given watermark is supposed to be embedded in a known host image . In order to increase the robustness of the watermarked image, for a watermark of bits, we embed it times. Doing this, the entire image is divided into 16 equal size blocks. For each block the complexity of the sub-image in that block is measured. The number of repetition of watermark bits saved in each block is determined according to the complexity level of that block. The complexity of a sub-images is measured using its quad tree representation. This approach not only secures the watermarked image with respect to usual attacks, but also, enables us to save longer bit patterns of watermark, while maintaining a good level of similarity between the original image and the watermarked one. For evaluating the performance of our method we have tested our method on 1000 images having low, medium and high level of complexity, and compared the result with the same set of images without considering the complexity of sub-images in blocks. Our new method provided %17 higher stability. Keywords: Watermark, bit distribution, quad-tree, image complexity.
1
Introduction
Watermarking digital images can be used to insure the legitimacy of the sending side and also proof of ownership of digital images. When a digital image is watermarked, it means that a pattern is hidden in the original image in such a way that the watermarked image looks identical with the original one when seen by any person. However, the owner of the image can prove the existence of a watermark pattern by analyzing the watermarked image using a decomposition program. One of the major difficulties in watermarking algorithms is that they A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 794–801, 2004. © Springer-Verlag Berlin Heidelberg 2004
Distribution of Watermark According to Image Complexity
795
should be able to resist against some usual attacks to images such as noise addition, filtering, compression, rotation, scaling, etc. Watermarking algorithms can be categorized into two groups. The first group are those that embed the watermark pattern inside the pixel values in spatial domain [2], [5]. The second group, work in transform domains such as Fourier, DCT, Wavelet, etc, [4]. In this paper for watermark embedding and detection we have used our previous work that is fully described in [3]. In short, [3] divides the entire host image into 16 blocks. A fixed number of bits from a repetition of the watermark bits, is saved into each block. The location of bits in each block that should embed a watermark bit is selected randomly. The detection algorithm is based on a Naive-Bayes Classifier, that is, instead of using a fixed and pre-defined threshold for watermark detection, it uses the results obtained in a training and learning phase. In order to obtain higher stability while marinating the image quality, we introduced the idea of distributing the number of repetitions of watermark bits based on the complexity of sub-images in blocks. We used the quad-tree representation of a sub-image to calculate its complexity.
2
Our Algorithm
One of the difficulties of most watermarking algorithms is that they do not take into consideration the level of complexity (i.e. detail) in host image to determine an optimum measure for the number of bits that can be embedded in them. In fact, if we save a constant number of bits in each block, when the number of bits increases, the degradation of the original image in blocks with less complexity will become evident. This is a clear limitation on the number of bits that can be embedded in the original image, which on the other hand affects the watermark stability. We could solve this problem by first computing the image complexity in each block, and then relate the number of bits saved into each block to its complexity measure.
2.1
Determing the Image Complexity
The image complexity can be computed according to its spatial or frequency features. There are several methods for this computation [6] , [7] . In this paper, since we use the spatial domain statistical distribution of the pixels for embedding the watermark bits, we found that a measure obtained by quad-tree representation of an image is most suitable for our purpose. Originally, quad-tree representation is introduced for binary images, but in case of gray scale images, since the gray level variance is a good measure for contrast, the variance is calculated in each 16 block of the image. In addition to the variance, a measure of similarity between gray levels of neighboring pixels of a block is calculated. If this variance is smaller than a certain threshold (i.e. it means that there is not much detail in that block) and there are too much similarity among the pixels of the block, then that block is not divided any more, otherwise it is divided into
796
M. Jamzad and F. Yaghmaee
4 blocks. The division of a block into 4 blocks is continued until either a block can not be divided any more, or reaching to one pixel (e.g. the original image size is assumed to be of a power of 2). Figure 1 shows a quad tree representation of two gray-scale image. According to our measure, both images have similar complexity, since the number of nodes in their quad tree (1.e. 29 and 21) are close to each other.. Figure 1-b is a tree with low depth and nearly balanced, but figure 1-a has a tree with high depth. For color images, the quad tree is constructed on its gray scale representation. For calculating the complexity, we used the sum of multiplication of the number of nodes in each level by 2 to the power of that level number. More description on the reason for this computation is given in section 4.
Fig. 1. A typical quad-tree representation of two images. (A) an unbalance quad tree with high depth with 29 nodes, (B) a balanced quad tree with low depth with 21 nodes
2.2
Distributing Watermark on Image Blocks by Their Complexity
We assume our original images are of size 512x512. Initially an image is divided into 16 equal size blocks (i.e. sub-images). Let be the image complexity calculated for blocks 1,2,... 16, and and be their minimum and maximum. Now we define a complexity division factor as follows:
Where K is a scale factor for It means that we are dividing the levels of complexity into K different classes, that is, the complexity of a block is quantized into K levels. The ranges of and are assigned to classes 1, 2, ... and K, respectively. These quantized levels of complexity are called for
Distribution of Watermark According to Image Complexity
797
Now assume we want to change N bits of the original image to embed M bits of watermark pattern (e.g where is an integer). For simplicity we assume Then that is the number of bits that can be modified in a block is determined according the following equation:
In this way by choosing an appropriate value for that is the number of repetition that a watermark is embedded in a an image, we can save the total bits in all blocks such that the number of bits saved in each block is directly related to its class of complexity.
2.3
Determining the Complexity of Watermarked Image
Because many bits in original image are modified in its watermarked version, the complexity of the original image and that of watermarked one are no longer the same. It means that we can not carry out the quad tree calculation on the watermarked image to determine the complexity values of its blocks. Another reason is that the watermarked image might have been attacked and modified by geometrical transformation, compression, etc, which will cause further dissimilarity. In order to find the complexity values from the watermarked image, we save the complexity values obtained from the original image, as part of the key inside the watermarked image. Since we have assumed and therefore, we need bits to show the complexity number of each block. The complexity numbers of all 16 blocks can be shown with bits. Therefore, we made the key in our algorithm to have two parts. The first part relates to the seed of the random number generator and the 2nd part is the bits corresponding to complexity numbers of all 16 blocks. Since we have the key, therefore, the complexity values are determined too.
Fig. 2. (a) Performance evaluation of K1 and K4-algorithms for N = 40000 bits. (b) The same but for N = 80000. The Y axis shows the percentage of stability
798
3
M. Jamzad and P. Yaghmaee
Stability Comparison
In this section we compare the stability of our new method with the case when a fixed number of bits is saved in each block. In both algorithms we trained 1000 images using Naive Bayes classifier and used N = 40,000 for watermark bits representation. For our new algorithm, the number of complexity classes were assumed to be K = 4. Note that, setting K = 1 changes our new algorithm to the algorithm which assumes only one level of complexity for all blocks. For simplicity of discussion, in the following, we name these two algorithms K1-algorithm (e.g. Fixed capacity) and K4-algorithm (e.g. Variable capacity), respectively. For comparing the capacity of images in embedding watermark bits , we watermarked 2000 images that were different from those used in training stage. The watermarked images from both K1 and K4-algorithms were randomly selected and attacks such as noise addition with different percentages, blurring, JPEG compression, chessboard effect and histogram equalization were applied on them. Figure 2 shows the result of this comparison for N = 40000 and N = 80000 pixels. As it is seen, in case of N = 40000, the K4-algorithm gives a %20 increase in capacity. But for N = 80000, only %4 increase is achieved. This bad performance of the later case is mainly due to the fact that, because of the large number of watermark bits, all blocks, independent of their complexity levels have gone through the maximum possible changes (i.e. host image is fully saturated by repetition of watermark bits). In addition as seen in figure 2, the stability of watermark has reached to %78 for N = 40000 in K4-algorithm compared to %82 for N = 80,000 in K1algorithm. Moreover, %78 of stability for N = 40000 in K4-algorithm is very close to %82 for N = 80000 in K1-algorithm. It means that using K4-algorithm in an image of size 512x512 by changing %15 of pixels (i.e. 40000) instead of %30 (i.e. 80000 pixels), we can reach to the same stability expected by K1-algorithm.
3.1
How the Number of Classes of Complexity Can Affect the Stability
The stability of watermarked images were examined by saving N = 40000 bits of watermark in each of 2000 images described in section 3 with different number of classes of complexity. Figure 3 shows the stability for number of classes of K = 2,4, 8,... and 32. As seen in this figure, the stability increases for K = 2,4 and 8, but further increase in the number of classes of complexity, K = 16 and 32 reduced the stability. The reason is that, by increasing K, the distribution of watermark bits in blocks with low complexity is reduced in an unpredictable way. In other words, the distribution is mainly done in blocks with higher levels of complexity. This means that in some blocks only a few repetition bits of watermark are embedded, which may causes incorrect extraction of watermark. This is the main reason for decreasing of stability. Our experimental results showed that defining K = 4 or K = 8 classes of complexity gives the highest stability. However for simplicity of calculations we selected K = 4.
Distribution of Watermark According to Image Complexity
799
Fig. 3. The relation between stability and the number of classes of complexity
4
How the Shape of Quad-Tree Is Related to Stability
A good measure of complexity should be based on the overall structure of the quad tree. Therefore, two most important factor in this regard will be the number of nodes and the depth of quad tree. Having the quad tree representation of an image, we examined the following four measures for complexity: 1. The number of nodes in quad tree. 2. A long decimal number, that its digits, from the lowest level to root, are the number of nodes in each level. 3. The sum of multiplication of the number of nodes in each level by that level number. 4. The sum of multiplication of the number of nodes in each level by 2 to the power of that level number.
Figure 4 shows the stability of the watermarked image using each of the above four methods. As seen in this figure, method 4 give the best stability. Therefore, we selected it to measure the image complexity.
Fig. 4. Comparison of four methods of complexity measurement and the number of classes of complexity with watermark stability
800
M. Jamzad and F. Yaghmaee
The reason for this behavior is the overall shape of the quad tree, but not only its number of nodes. Figure 1-a has an unbalanced quad tree with high depth, but figure 1-b has a well balanced quad tree with low depth. Lower stability in figure 1-a is because, more nodes are assigned only to a few blocks of the image and as a result, most of watermark bits have to be saved in these few blocks. For figure 1-b nodes are evenly assigned to blocks of the image (i.e. a more balanced tree) and as a result, watermark bits are evenly distributed over the entire image. It is known that clipping is one of the most serious attacks to watermarked images. If the watermark bits are more distributed in a few blocks, and it happens that only those blocks are clipped away, we will lose most of our information and the watermark pattern may not be retrived. The main reason for lower performance of the images with unbalanced quad tree is due to their weakness with respect to clipping.
5
The Relation Between Stability and the Total Number of Pixel Modification
To see how the total number of pixel modification in the watermarked image can affect its stability, we tested 200 images such that, %5 to %45 of image pixels were modified to embed the watermark bits. The stability was calculated for each case for K = 4 class of complexity (i.e. we note that K = 4 has given the best stability as described in 3.1). Figure 5 shows that the stability increases with the increase of percentage of number of pixels modified. But it reaches to a stable state after %25. The reason for this behavior is that, the sub-image blocks in host image will become saturated by watermark bits. It means that further modification not only can not increase the stability but will decrease the watermarked image quality.
Fig. 5. The relation between the percentage of total number of pixels modified in host image and the stability.
Distribution of Watermark According to Image Complexity
6
801
Conclusion
There are a great number of watermarking methods that work in spatial and frequency domains. All these methods try to provide a secure way to embed watermark in an image in such a way that it would be robust with respect to some usual attacks to digital images. However, the performance of these algorithms differs with respect to the type of attack and the amount of their robustness with respect to those attacks. In this paper we showed that, regardless of the method of watermarking used, we can increase its robustness by choosing a method that uses the content of an image to decide how to distribute watermark bits over the entire image. In this regard, we divided the entire host image into 16 blocks and for each block a measure of complexity for the sub-image of that block was introduced by using its quad tree representation. Our experiments showed that the sum of multiplication of the number of nodes of quad tree in each level by 2 to the power of that level number, was the best measure for the complexity. In this paper, we examined the distribution strategy by dividing the complexity measure of blocks into 1, 4, 8, 16, .. classes. Then the performance of the watermarking algorithm was evaluated with respect to its stability towards usually attacks such as, noise addition, smoothing, compression, clipping, rotation, shuffling and chessboard effect. The best result was obtained with 4 classes of complexity. Moreover, as a result of this work, we conclude that, there is a limit for the stability when we are forced to embed a particular watermark pattern into a given host image. We suggest that, it is possible to get an optimum robustness for a given host image, if we can choose the watermark pattern with different number of bits from a watermark pattern data bank in such a way that it gives the optimum robustness.
References 1. M.Kutter,F.jordan,and F.Bossen, Digital watermarking of color image using amplitude modulation, Journal of electron imaging , vol. 7 no2 .pp 326-332 ,1998. 2. B.Chen, Design and analysis of digital water marking, information embedding and data hiding systems,PHD thesis,MIT university,June 2000 3. F.Yaghmaee,M.Jamzad, A robust watermarking method for color images using Naive-Bayes classifier, IASTED Con on signal and image processing (SIP 2003) August 2003 4. J.Cao,J.Fowler,and N.Younan, An image-adaptive watermark based on redundant wavelet transform, IEEE Con on image processing, pp. 277-280 Oct 2001 5. C.I.Podilchuk, E.j.Delp Digital Watermarking: algorithms and applications IEEE Signal Processing Magazine. July 2003 6. R.Franco, D.Malah, Adaptive Image Partitioning for Fractal Coding Achieving Designated Rates Under a Complexity Constraint, IEEE 2001 International Conference on Image Processing, 2001 7. Chandramouli, N. Memon, How Many Pixels to Watermark , The International Conference on Information Technology: Coding and Computing (ITCC’00), Nevada, 2000
Comparison of Intelligent Classification Techniques Applied to Marble Classification João M.C. Sousa and João R. Caldas Pinto Technical University of Lisbon, Instituto Superior Técnico Dept. of Mechanical Engineering, GCAR/IDMEC 1049-001 Lisbon, Portugal {jcpinto,j.sousa}@dem.ist.utl.pt
Abstract. Automatic marbles classification based on their visual appearance is an important industrial issue. However, there is no definitive solution to the problem, mainly due to the presence of randomly distributed high number of different colors and due to the subjective evaluation made by human experts. In this paper, we present a study of soft computing classification algorithms, which proved to be a valuable tool to be applied in this type of problems. Fuzzy, neural, simulated annealing, genetic and combinations of these approaches are compared. Color and vein classification of marbles are compared. The combination of fuzzy classifiers optimized by genetic algorithms revealed to be the best classifier for this application.
1 Introduction The automatic classification of objects based on their visual appearance is an important industrial issue in different areas like the wool manufacturing and ornamental stones such as marbles. However, natural surfaces put challenging problems due to their great richness in colors and patterns. This paper deals with marble stones. In general, ornamental stones are quantitatively characterized by properties such as geological-petrographical and mineralogical composition, or mechanical strength. The properties of such products differ not only in terms of type, but also in terms of origin, and their variability can also be significant within the same deposit or quarry. Further, the appearance of marbles is conditioned by the type of stone, and by the subjective evaluation of “beauty”. Traditionally, the selection process is based on human visual inspection, given a subjective characterization of the materials’ appearance, instead of an objective, reliable measurement of the visual properties, such as color, texture, shape and dimensions of their components. However, quality control is essential to keep marble industries competitive: shipments of finished products (e.g. slabs) must be of uniform quality, and the price demanded for a product, particularly a new one, must be justified. Thus, it is very important to have a tool for the objective characterization of appearance. In this paper, we are concerned with marble classification. Several intelligent classification techniques are presented and discussed for this purpose. In order to support these algorithms and to better emulate results obtained by human experts, several distances and measures are used [6]. The classification techniques obtain the classes based on a set of features. These features are derived by a quadtree based segmentation analysis of marble images, as previously presented and discussed in [5]. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 802–809, 2004. © Springer-Verlag Berlin Heidelberg 2004
Comparison of Intelligent Classification Techniques
803
This paper compares four different classification techniques, namely simulated annealing [7], fuzzy classification based on fuzzy clustering [1,9], neural networks [4] and a fuzzy classification algorithm optimized using a real-coded genetic algorithm [8]. The paper is organized as follows. Section 2 presents the marbles segmentation technique used in the paper, namely the quadtree segmentation analysis of images. Section 3 presents a short description of the intelligent classification algorithms. TakagiSugeno fuzzy models optimized by genetic algorithms was introduced recently, and is described in more detail. The results for color and vein classification are presented in Section 4. Finally, some conclusions are presented in Section 5.
2 Data Segmentation Techniques Color based surfaces segmentation is still an active topic of research due to its inherent difficulty and unquestionable practical interest in problems like automatic surfaces quality control. So the envisaged solutions are much correlated with the application. One of the mail goals is to be able to emulate by computer the way as human experts classify marbles, based on their visual appearance. Key questions to be answered are concerned with the color homogeneity and color similarity. In marble classification for commercial use the key question is how similar two marbles are. In general, marbles are classified in terms of color and veins (with without or with some veins), see Fig. 1. A possible
Fig. 1. Relevant characteristics for marble classification.
solution to this problem is to use a measure of similarity based on the RGB histograms. Based on this measure and on a quadtree division of the image, a segmented image can be obtained [5]. Each homogeneous region is characterized by the mean of the distribution. Then, its feature vector is given by: The image that results from the above processing is then composed by a set of regions. This will constitute a color palette. However, one is usually interested in retaining only those colors that can be visually distinguished. According with this way of perception, which implies one is subjectively attracted by the bigger areas, the following algorithm is suggested: Repeat for all the segmented regions: 1. Order the different encountered regions according to their area. 2. The first region in the rank in than compared with the remaining ones.
804
J.M.C. Sousa and J.R. Caldas Pinto
3. The regions that match a given distance criterion will have their color feature re-
placed by the color feature of the first region, and are taken out of the list. To measure the visual distinction between colors, they were first converted to the HSV space and then the Manhattan distance is used to compute the respective distances between regions. Finally, all this process of image decomposition and reconstruction generates a set of useful information that permit to define the following set of image features to be used in future marbles clustering and classification: Total area of a region. Number of connected regions (fragments) that originate that area. The RGB values attributed to the area (RGB of the largest connected region). The average and variance of the HSV values of the connected regions.
3 Classification Algorithms This papers uses both supervised and unsupervised clustering techniques to derive the models that will classify the marbles. The unsupervised methods do not use the training set classification to derive the proper clusters. This type of algorithms has a cost function that must be optimized. The classification of a given data point (marble) is achieved through the distance of its features to the respective clusters. This paper uses two types of unsupervised clustering algorithms: simulated annealing and fuzzy c-means. In the supervised methods, the classification given in the training set is used to derive the clusters. Thus, the classification is achieved through the simulation of the derived model, where the inputs are the marbles to be classified. The supervised methods applied in this paper are neural networks, and fuzzy models optimized by a genetic algorithm. Let be a set of N data objects where each object is an instance represented by the vector which is described by a set of features. The set of data objects can then be represented as a data matrix X. The clustering algorithms used in this paper determine a partition of X into C clusters.
3.1 Simulated Annealing This optimization algorithm simulates the annealing process, by reproducing the crystalization process of particles. These particles move during the solidification process, which occurs when the temperature decreases. In the final stage, the system reaches the minimal energy configuration [7]. The process starts at an initial temperature, which should consider the number of changes that increases the system energy, in order to allow the system to escape from local minima. This algorithm defines the proportion for acceptable changes that increases the system’s energy. The cost function is represented by the Euclidean or Manhattan distances between the data points and the cluster centers. This cost function gives an indication of the convergence of the system to a lower energy state. The number of iterations to accomplish at each temperature is proportional to the number of elements. This parameter is proportional to the dimension of the data set N. The temperature T must decrease between the several energy levels. For consecutive levels the temperature decrement is given by
Comparison of Intelligent Classification Techniques
805
The stop criterion can be given by a predetermined number of temperature decreases or when after some consecutive temperature decreases the final cost does not change.
3.2 Fuzzy Clustering Considering the set of data objects X, a fuzzy clustering algorithm determines a fuzzy partition of X into C clusters by computing a N × C partition matrix U and the C-tuple of corresponding cluster centers: Often, the cluster prototypes are points in the cluster space, i.e. The elements of U represent the membership of data object in cluster Many clustering algorithms are available for determining U and V iteratively. The fuzzy c-means is quite well-known and revealed to present good results [1]. This algorithm does not determine directly the optimal number of clusters. This paper uses simple heuristics to determine the correct number of clusters, in order to reduce the number of colors classifying the different samples of marbles. The paper uses the standard fuzzy c-means algorithm based on the Euclidean distance. The classification of marbles in the several classes is obtained by using fuzzy clustering in the product-space of the features.
3.3 Neural Networks Neural networks have been largely used as input-output mapping for different applications including modeling, control and classification [4]. The main characteristics of a neural network are its parallel distributed structure and its ability to learn, which produces reasonable outputs for inputs not encountered during training. Moreover, the structure can be chosen to be simple enough to compute the output(s) from the given input(s) in very low computational time. The neural network used in this paper has hidden-layers with hyperbolic tangent activation functions and a linear output layer. The network must have few neurons in the hidden-layers, and is trained using the resilient backpropagation algorithm [2].
3.4 Fuzzy Modeling Fuzzy models have gained in popularity in various fields such as control engineering, decision making, classification and data mining [9]. One of the important advantages of fuzzy models is that they combine numerical accuracy with transparency in the form of rules. Hence, fuzzy models take an intermediate place between numerical and symbolic models. A method that has been extensively used for obtaining fuzzy models is fuzzy clustering. Takagi-Sugeno fuzzy models using a genetic algorithm (GA) as described in [8] proved to be a good approach in terms of precision and interpretability. This method is very briefly described in the following. Takagi-Sugeno (TS) fuzzy models [10], consist of fuzzy rules where each rule describes a local input-output relation, typically in an affine form. The affine form of a TS model is given by:
806
J.M.C. Sousa and J.R. Caldas Pinto
where is the rule, K denotes the number of rules, is the antecedent vector, is the number of inputs, are fuzzy sets defined in the antecedent space, is the output variable for rule is a parameter vector and is a scalar offset. The consequents of the affine TS model are hyperplanes in the product space of the inputs (features) and the output. The model output, can then be computed by aggregating the individual rules contribution:
where
is the degree of activation of the ith rule: and is the membership function of the fuzzy set in the antecedent of data set Z to be clustered is composed from X and the classification y:
The
Given the data Z and the number of clusters K, the Gustafson-Kessel clustering algorithm [3] is applied to compute the fuzzy partition matrix U and the clusters centers V as introduced in Section 3.2. Unlike the popular fuzzy c-means algorithm, the Gustafson-Kessel algorithm applies an adaptive distance measure. As such, it can find hyper-ellipsoid regions in the data that can be efficiently approximated by the hyper-planes described by the consequents in the TS model. The fuzzy sets in the antecedent of the rules are obtained from the partition matrix U, whose ikth element is the membership degree of the data object in cluster One-dimensional fuzzy sets are obtained from the multidimensional fuzzy sets defined point-wise in the ith row of the partition matrix by projections onto the space of the input variables
where proj is the point-wise projection operator. The point-wise defined fuzzy sets are approximated by suitable parametric functions in order to compute for any value of The consequent parameters for each rule are obtained as a weighted ordinary least-square estimate. Let let denote the matrix [X; 1] and let denote a diagonal matrix in having the degree of activation, as its kth diagonal element. Assuming that the columns of are linearly independent and for the weighted least-squares solution of becomes
Rule bases constructed from clusters are often unnecessary redundant due to the fact that the rules defined in the multidimensional premise are overlapping in one or more dimensions. The resulting membership functions will thus be overlapping as well, and more fuzzy sets will describe approximately the same concept. Therefore, this model is optimized using a real-coded genetic algorithm, as proposed in [8]. This algorithm is described in the following section.
Comparison of Intelligent Classification Techniques
807
3.5 Genetic Algorithm for Fuzzy Model Optimization Given the data matrix Z and the structure of the fuzzy rule base, select the number of generations and the population size L. 1. Create the initial population based on the derived fuzzy model structure. 2. Repeat genetic optimization for a) Select the chromosomes for operation and deletion. b) Create the next generation: operate on the chromosomes selected for operation and substitute the chromosomes selected for deletion by the resulting offspring. c) Evaluate the next generation by computing the fitness for each individual. 3. Select the best individual (solution) from the final generation.
4 Results 4.1 Parameters of the Algorithms The total data set is composed by 112 marbles from which 69 constitute the training set and 43 the test set. The classifiers based on simulated annealing and fuzzy c-means used 6 clusters to classify marbles color and and 3 clusters to classify marbles veins. These numbers revealed to give the best results. The simulated annealing algorithm uses an acceptable change in the initial temperature of 0.3, the constant used to compute the number of iterations to accomplish in each temperature is set to 5, the parameter in (1) is equal to 0.9, and finally 100 iterations (temperature decreases) are allowed. The multi-layer perceptron structure has three hidden layers with 9,12 and 9 neurons, which achieved the best classification results. These parameters were obtained experimentally. The number of epochs was set to 100, in order to avoid the overfitting phenomenon. TS fuzzy models were derived and optimized using the procedure described in Section 3.4. As the output of these models is a number belonging to this real number must be converted to a classification As so, the output of the model is set to a class as follows: Thus, the classification value are classified, and to the set
corresponds to the set for veins classification.
when colors
4.2 Comparison of Techniques The techniques tested in this paper are all optimization algorithms that can converge to local minima. Thus, there is no guarantee that the obtained models are the best possible. However, all the techniques are based on the assumption that most often the algorithms converge to the global minimum, or at least to a value very close to it. Therefore, it is necessary to test the algorithms in statistical terms. As all the classification techniques depend on initial random values, different runs can result in different classifiers, when the seed of the pseudo-random numbers is different. The four algorithms used in this paper, run 50 times. The mean classification error was computed for each model. For each intelligent technique, the mean and the standard deviation of the 50 classifiers is computed in order to compare the performance of the algorithms.
808
J.M.C. Sousa and J.R. Caldas Pinto
The color classification is obtained using as features the mean HSV measure of the marbles, as this measure is known to be the best feature for color. The mean errors and the standard deviation of the train and test marble sets for the color are presented in Table 1. This table shows that the errors are relatively small for all the techniques. The
best model obtained using the train data is the optimized TS fuzzy model. However, in terms of test data the neural network performs slightly better. These two techniques are also more reliable as their standard deviation for the test set is smaller than the ones of simulated annealing and fuzzy clustering. Only some of the features described in Section 2, namely, area, number of fragments, and variance of the HSV values are used to perform vein classification. These features revealed to be the most relevant. The results obtained for the veins classification are presented in Table 2. Again, the errors are relatively small for all the techniques. In
terms of train data, the neural network presents the smallest error. However, this result is misleading as the TS fuzzy models are clearly the best classifiers in the test data. Further, TS models also present the smallest deviation. Note that the techniques that did not perform so well in terms of color classification, simulated annealing and fuzzy clustering present now very similar results to neural networks, in terms of error in the test data. Globally, when one intends to classify both color and veins in marbles, the TS fuzzy models optimized by a GA are the best technique from the four tested in this paper.
Comparison of Intelligent Classification Techniques
809
5 Conclusions This paper compares four different intelligent techniques to classify marbles in terms of color and veins. The segmentation techniques that derive the best features for classification are briefly discussed. Then, the intelligent classification techniques used in this paper are presented. The technique of deriving TS fuzzy classifiers optimized using a GA is explained in more detail. The results show that this technique is globally better than the others. Future work will deal with a more deep study of the important features in order to obtain better classifiers, and with the statistical validation of the results using cross validation. Acknowledgements. This research is partially supported by the “Programa de Financiamento Plurianual de Unidades de I&D (POCTI), do Quadro Comunitário de Apoio III”, and by the FCT project POCTI/2057/1995 - CVAM, 2nd phase, Ministério do Ensino Superior, da Ciência e Tecnologia, Portugal.
References 1. J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function. Plenum Press, New York, 1981. 2. C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1995. 3. D. Gustafson and W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, pages 761–766, San Diego, CA, USA, 1979. 4. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice–Hall, Upper Saddle River, NJ, second edition, 1999. 5. I.-S. Hsieh and K.-C. Fan. An adaptive clustering algorithm for color quantization. Pattern Recognition Letters, 21:337–346, 2000. 6. J. R. C. Pinto, J. M. C. Sousa, and H. Alexandre. New distance measures applied to marble classification. In A. Sanfeliu and J. Ruiz-Shulcloper, editors, Lecture Notes on Computer Science 2905, CIARP’2003, pages 383–390. Springer-Verlag, Havana, Cuba, 2003. 7. P. Salamon, P. Sibani, and R. Frost. Fact, Conjectures, and Improvements for Simulated Annealing. SIAM, Philadelphia, USA, 2002. 8. M. Setnes and H. Roubos. GA-fuzzy modeling and classification: complexity and performance. IEEE Transactions on Fuzzy Systems, 8(5):516–524, October 2000. 9. J. M. C. Sousa and U. Kaymak. Fuzzy Decision Making in Modeling and Control. World Scientific Pub. Co., Singapore, 2002. 10. T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modelling and control. IEEE Transactions on Systems, Man and Cybernetics, 15(1):116–132, Jan./Feb. 1985.
Inspecting Colour Tonality on Textured Surfaces Xianghua Xie, Majid Mirmehdi, and Barry Thomas Department of Computer Science, University of Bristol, Bristol BS8 1UB, England {xie,majid,barry}@cs.bris.ac.uk
Abstract. We present a multidimensional histogram method to inspect tonality on colour textured surfaces, e.g. ceramic tiles. Comparison in the noise dominated chromatic channels is error prone. We perform vectorordered colour smoothing and generate a PCA-based reconstruction of a query tile based on a reference tile eigenspace. Histograms of local feature vectors are then compared for tonality defect detection. The proposed method is compared and evaluated on a data set with groundtruth.
1
Introduction
The assessment of product surfaces for constant tonality is an important part of industrial quality inspection, particularly in uniform colour surfaces such as ceramic tiles and fabrics. For example with tiles, any changes in the colour shade, however subtle, will still become significant once the tiles are placed on a bathroom wall. This is a key problem in the manufacturing process, and quite tiresome and difficult when inspection is carried out manually. The problem is compounded when the surface of the object is not just plain-coloured, but textured. In short, colour shade irregularities on plain or textured surfaces are regarded as defects and manufacturers have long sought to automate the identification process. Colour histograms have proved their worth as a simple, low level approach in various applications, e.g. [1,2,3]. They are invariant to translation and rotation, and insensitive to the exact spatial distribution of the colour pixels. These characteristics make them ideal for use in application to colour shade discrimination. The colours on textured (tile) surfaces are usually randomly or pseudo-randomly applied. However, the visual colour impression of the decoration should be consistent from tile to tile. In other words, the amount of ink and the types of inks used for decoration of individual tiles should be very similar in order to produce a consistent colour shade, but the spatial distribution of particular inks is not necessarily fixed from one tile to the next (see Fig. 1). Thus, colour histogram based methods are highly appropriate for colour shade inspection tasks. Numerous studies on tile defect detection are available, such as [4,5]. The only colour grading work known to us has been reported by Boukouvalas et al., for example in [6,2]. In the former work, the authors presented spatial and temporal constancy correction of the image illumination on the surfaces of uniform colour and two-colour patterned tiles. Later in [2], they proposed a colour histogram based method to automatically grade colour shade for randomly textured tiles by measuring the difference between the RGB histograms of a reference tile and A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 810–817, 2004. © Springer-Verlag Berlin Heidelberg 2004
Inspecting Colour Tonality on Textured Surfaces
811
each newly produced tile. By quantising the pixel values to a small number of bins for each band and employing an ordered binary tree, the 3D histograms were efficiently stored and compared. In this paper, we present a multidimensional histogram approach to inspect colour tonality defects on randomly textured surfaces. The method combines local colour distribution with global colour distribution by computing the local common colours and local colour variations to characterise the colour shade properties as part of the histogrammed data. The tiles used were captured by a line-scan camera and manually classified into ‘standard’ and ‘off-shade’ categories by experts. A reference tile image is selected from a small set of good samples using a voting scheme. Initially, a vector directional processing method is used to compute the Local Common Vector amongst pixels in the RGB space. This is first used to eliminate local noise and smooth the image. Then, a nine element feature vector is computed for each colour pixel in the image composed of the colour pixel itself, its local common vector, and its local colour variance. To minimise the influence of noise, principal component analysis is performed in this 9D feature space. The first few eigenvectors with the largest eigenvalues are selected to form the reference eigenspace. The colour features are then projected into this eigenspace and used to form a multidimensional histogram. By projecting the colour features of an unseen tile into the same reference eigenspace, a reconstructed image is obtained and histogram distribution comparison can be performed to measure the similarity between the new and the reference tiles. We also demonstrate that the reconstructed image shows much less noise in the chromatic channels. Finally, we present our comparative results. In Section 2, our proposed method is introduced, outlining noise analysis, local common vector smoothing, feature generation, eigenspace analysis, and histogram comparison using the linear correlation coefficient. Implementation details and results are shown in Section 3. Section 4 concludes the paper.
2
Proposed Approach
The difference between normal and abnormal randomly textured colour shades can be very subtle. Fig. 1 shows a particularly difficult example where the left and centre tiles belong to the same colour shade class and considered as normal samples, while the right one is an example of “off-shade” and should be detected as a defect.
2.1
Noise Analysis
While effort can be put into achieving uniform spatial lighting and temporal consistency during image capture, some attention must still be paid to the problem of image noise introduced in the imaging system chain. In this application, the tiles were imaged by a 2048 pixel resolution ‘Trillium TR-32’ RGB colour linescan camera. The acquired image size varied from 600 × 800 pixels to 1000 × 1000 pixels corresponding to the physical size of the tiles. To examine the noise, we performed Principal Component Analysis (PCA) directly on the RGB image. The pixel colours were then projected to the three
812
X. Xie, M. Mirmehdi, and B. Thomas
Fig. 1. An example of ceramic tiles with different colour shades - from left: The first two images belong to the same colour shade, the last one is an example of off-shade.
orthogonal eigenvectors, and finally mapped back to the image domain to obtain one image for each eigenchannel. An example of this is shown in Fig. 2 for the leftmost tile in Fig. 1. The first eigenchannel presents the maximum variation in the RGB space, which is in most cases the intensity. The other two orthogonal eigenchannels mainly show the chromatic information. The last eigenchannel is dominated by image noise. The vertical lines are introduced mainly by spatial variation along the line-scan camera’s scan line and the horizontal lines are introduced by temporal variations, ambient light leakage, and temperature variations.
Fig. 2. Image noise analysis showing the three eigenchannels. The noise is highly visible in the third channel. The images have been scaled for visualisation purposes.
Clearly, the noise can dominate in certain chromaticity channels, but poses a minor effect on the intensity channel which usually has the largest variation for tile images. Direct comparison in the chromatic channels is likely to be error prone. For colour histogram based methods, each bin has identical weight and the image noise will make the distribution comparison unreliable when colour shade difference is small. For most tile images, the actual colours only occupy a very limited portion of the RGB space. In other words, the variations in chromaticity are much smaller than those of brightness. However, although the variation of
Inspecting Colour Tonality on Textured Surfaces
813
the image noise is small, it can still overwhelm the chromaticity. A variety of smoothing or diffusion methods can be used to explicitly minimise the negative effect of chromatic noise. We found vector directional smoothing [7] to be an effective and robust approach for this purpose. We adopt its underlying principles to compute the Local Common Vector (LCV), which is later also used as an additional component of our colour feature set to characterise surface shade.
2.2
Vector Directional Median and LCV
Following the work in [7], a colour is represented as a vector in the 3D RGB space. The triangular plane connecting the three primaries in the RGB cube is known as the Maxwell triangle. The intersection point of a colour vector with the Maxwell triangle gives an indication of the chromaticity of the colour, i.e. its hue and saturation, in terms of the distance of the point from the vertices of the triangle. As the position of the intersection point only depends on the direction of the colour vector, and not the magnitude, this direction then represents the chromaticity. The angle between any two colour vectors, e.g. and represents the chromaticity difference between them. So, the directional median of the set of vectors within a window on the image can be considered as the vector that minimises the sum of the angles with all the other vectors in the set. The median is insensitive to extremes; as the vector direction/chromaticity determines the colour perception, the noise due to the imaging system can be approximately suppressed using this median vector. Let be the image, a map from a continuous plane to the continuous space For a colour image, A window with a finite number of pixels is implied in calculating the directional median. The pixels in W are denoted as The element denoted as for convenience, is an vector in the space of Thus the vectors in W define the input set Let be the sum of the angles between the vector and each of the vectors in the set. Then,
where denotes the angle between vectors colour image. Then, the ascending order of all the gives
and
in a
The corresponding order of the vectors in the set is given by
The first term in (3) minimises the sum of the angles with all the other vectors within the set and is considered as the directional median. Meanwhile, the first terms of (3) constitute a subset of colour vectors which have generally the same direction. In other words, they are similar in chromaticity, but they can be quite different in brightness, i.e. magnitude. However, if they are also similar in brightness, we need to choose the vector closest to By considering the first
814
X. Xie, M. Mirmehdi, and B. Thomas
terms we define a new simple metric so that the difference between any pair of vectors in the set is measured as
where denotes the magnitude of a vector. Thus, the vector that has the least sum of differences to other vectors is considered as the LCV. However, for computational efficiency, we select the LCV from the first terms as the one that possesses the median brightness attribute with approximately similar accuracy. The value of was empirically chosen as Alternatively, an adaptive method, as described in [8], can be used to select the value of Thus the LCV is computed in a running local window to smooth the image. The LCV will also then be used as a component of the colour feature vector applied for shade comparison.
2.3
Distribution Comparison in Eigenspace
Comparing global colour distributions between a reference tile and an unseen tile alone is not always enough, as subtle variations may be absorbed in the colour histograms. The evaluation of local colour distribution becomes a necessity. Setting up the Reference - A reference tile is selected using a simple voting scheme (details in Section 3). For any pixel with its colour vector its brightness is represented by the magnitude and its direction (chromaticity) is determined by the two angles and (that it makes with two of the axes in the RGB cube). Thus, we form a nine-element colour feature vector comprising the colour pixel itself, its LCV denoted as and the variances of the local colours in brightness and chromaticity measured against the LCV. Let and denote the dimensions of the colour tile image, and X be a meancentred matrix containing the colour features, where and Then, PCA is performed to obtain the eigenvectors (principal axes) denoted by The matrix of eigenvectors are given as The columns of E are arranged in descending order corresponding to the eigenvalues Only eigenvectors with large eigenvalues are needed to represent X to a sufficient degree of accuracy determined by a simple threshold T:
We refer to the subset thresholded with T as the reference eigenspace where our colour features are well represented and surfaces with the desired shade should have a similar distribution. Characteristics not included in are small in variation and likely to be redundant noise. Colour feature comparison is then performed in this eigenspace for unseen tiles. The reference setup is completed by projecting the original feature matrix X into eigenspace resulting in Verifying New Surfaces - For a novel tile image, the same feature extraction procedure is performed to obtain the colour feature matrix Y. However, Y is then projected into the reference eigenspace resulting in Note PCA is
Inspecting Colour Tonality on Textured Surfaces
815
Fig. 3. Image reconstruction - top: The original image, the reconstructed image, and their MSE difference - bottom: the three eigenchannels of the reconstructed tile. The last channel shows texture structure, instead of being dominated by noise (cf. Fig. 2).
not performed on Y. This projection provides a mapping of the new tile in the reference eigenspace where defects will be made to stand out. Finally, 9D histogram comparison is performed to measure the similarity between and In [2], Boukouvalas et al. found that for comparing distributions of such kinds the Normalised Cross Correlation (NCC) performs best as it is bounded in the range [–1..1] and easily finds partitioning which assigns only data with acceptable correlation to the same class. For pairs of quantities
where and are the respective means. The NCC represents an indication of what residuals are to be expected if the data are fitted to a straight line using least squares. When a correlation is known to be significant, NCC lies in a pre-defined range and the partition threshold is easy to choose. Direct multidimensional histogram comparison is computationally expensive, however for tile images, the data usually only occupies a small portion of the feature space. Thus, only those bins containing data are stored in a binary tree structure. Unlike [2], we found it unnecessary to quantise the histogram. For comparison, we can reconstruct the tile image by mapping the colour features in the eigenspace back to the RGB space. Taking the leftmost image in Fig. 1 as the reference image providing the reconstructed colour features are
816
X. Xie, M. Mirmehdi, and B. Thomas
Then taking the first three elements, adding the deducted means and mapping back to the image domain gave the reconstructed tile image, as shown in Figure 3 along with the Mean Square Error (MSE) between the original and the reconstructed images. Next, noise analysis in the reconstructed image was performed (as in Section 2.1 and illustrated in Fig. 2) showing that its 3rd channel is much less noisy (bottom row of Figure 3).
3
Implementation and Results
Our test data comprises eight tile sets, totalling 345 tile images, with known groundtruth obtained from manual classification by experts. Some sets contain significant off-shade problems, while other sets have only very subtle differences. Within each set, one-third of tiles are standard colour shade and two-thirds offshade. We use this data to evaluate the proposed method and compare it with a 3D colour histogram-based approach. Inspection starts with the selection of a reference tile using a voting scheme. First, a small number of good tiles are each treated as a reference and compared against each other. Each time the one with the largest NCC value is selected as the most similar one to the reference, and only its score is accumulated. Finally, the tile with the largest score becomes the reference. The NCC threshold is chosen during this process as the limit of acceptable colour shade variation.
Table 1 shows the test results. For standard 3D colour histogramming in RGB the overall average accuracy was 88.16% (41 tiles misclassified). Specificity and sensitivity results are also shown. The processing (including NCC) requires about 1 second per 1000 × 1000 pixel image. The last five columns of Table 1 present the results of the proposed method for different W, including specificity and sensitivity results for the 3 × 3 version. The LCV computation proved to be beneficial as it decreased the negative effects introduced by noise in chromaticity. By incorporating the local colour information and comparing the dominant
Inspecting Colour Tonality on Textured Surfaces
817
colour features using a high dimensional histogram, an overall 93.86% accuracy was achieved (21 tiles misclassified). Different window sizes were tested, from 3 × 3 to 11 × 11 (not all shown), with the best results at 94.74% for 7 × 7 (18 tiles misclassified) at somewhat greater computational cost. Marked improvements in the specificity and sensitivity results were also observed. For practical implementation this technique needs to run at approximately 1 second/tile. Currently, the bottleneck in our system is in the LCV computation. The proposed method requires a computational time in the order of 20 seconds per tile at present: 0.98 seconds for its 9D histogramming, 18 seconds for LCV computation and smoothing, and 0.94 seconds for NCC computation.
4
Conclusions
We presented an automatic colour shade defect detection algorithm for randomly textured surfaces. The shade problem is defined here as visual perception in colour, not in texture. We revealed the chromatic noise through eigenchannel analysis and proposed a method to overcome it using local and global colour information and PCA analysis on a new representative colour vector. The chromatic channels of the reconstructed image were found to be much less dominated by noise. A window size as small as 3 × 3 gives an overall accuracy of 93.86%. However, the increase in accuracy comes at a computational cost which is hoped will be overcome through more optimised code, and faster hardware and memory.
Acknowledgments. The authors thank Fernando Lopez for the tile database. This work is funded by EC project G1RD–CT–2002–00783–MONOTONE, and X. Xie is partly funded by the ORS, UK.
References 1. Swain, M., Ballard, D.: Indexing via color histograms. IJCV 7 (1990) 11–32 2. Boukouvalas, C., Kittler, J., Marik, R., Petrou, M.: Color grading of randomly textured ceramic tiles using color histograms. IEEE T-IE 46 (1999) 219–226 3. Pietikainen, M., Maenpaa, T., Viertola, J.: Color texture classification with color histograms and local binary patterns. In: IWTAS. (2002) 109–112 4. Kittler, J., Marik, R., Mirmehdi, M., Petrou, M., Song, J.: Detection of defects in colour texture surfaces. In: IAPR Machine Vision Applications. (1994) 558–567 5. Penaranda, J., Briones, L., Florez, J.: Color machine vision system for process control in the ceramics industry. SPIE 3101 (1997) 182–192 6. Boukouvalas, C., Kittler, J., Marik, R., Petrou, M.: Automatic color grading of ceramic tiles using machine vision. IEEE T-IE 44(1) (1997) 132–135 7. Trahanias, P., Venetsanopoulos, A.: Vector directional filters - a new class of multichannel image processing filters. IEEE T-IP 2 (1993) 528–534 8. Trahanias, P., Karakos, D., Venetsanopoulos, A.: Directional processing of color images: theory and experimental results. IEEE T-IP 5 (1996) 868–880
Automated Visual Inspection of Glass Bottles Using Adapted Median Filtering Domingo Mery1 and Olaya Medina2 1
Departamento de Ciencia de la Computación Pontificia Universidad Católica de Chile Av. Vicuña Mackenna 4860(143), Santiago de Chile Tel. (+562) 354-5820, Fax. (+562) 354-4444
[email protected] http://www.ing.puc.cl/˜dmery 2
Departamento de Ingeniería Informática Universidad de Santiago de Chile Av. Ecuador 3659, Santiago de Chile
Abstract. This work presents a digital image processing technique for the automated visual inspection of glass bottles based on a well-known method used for inspecting aluminium die castings. The idea of this method is to generate median filters adapted to the structure of the object under test. Thus, a “defect-free” reference image can be estimated from the original image of the inspection object. The reference image is compared with the original one, and defects are detected when the difference between them is considerable. The configuration of the filters is performed off-line including a priori information about real defect-free images. In the other hand, the filtering self is performed on-line. Thus, a fast on-line inspection is ensured. According to our experiments, the detection performance in glass bottles was 85% and the false alarms rate was 4%. Additionally, the processing time was only 0.3s/image. Keywords: automated visual inspection, median filter, glass bottles, ROC curves.
1
Introduction
Visual inspection is defined as a quality control task that determines if a product deviates from a given set of specifications using visual data1. Inspection usually involves measurement of specific part features such as assembly integrity, surface finish and geometric dimensions. If the measurement lies within a determined tolerance, the inspection process considers the product as accepted for use. In industrial environments, inspection is performed by human inspectors and/or automated visual inspection (AVI) systems. Human inspectors are not 1
For a comprehensive overview of automated visual inspection, the reader is referred to an excellent review paper by Newman and Jain [1]. The information given in this paragraph was extracted from this paper.
A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 818–825, 2004. © Springer-Verlag Berlin Heidelberg 2004
Automated Visual Inspection of Glass Bottles
819
always consistent and effective evaluators of products because inspection tasks are monotonous and exhausting. Typically, there is one rejected in hundreds of accepted products. It has been reported that human visual inspection is at best 80% effective. In addition, achieving human ‘100%-inspection’, where it is necessary to check every product thoroughly, typically requires high level of redundancy, thus increasing the cost and time for inspection. For instance, human visual inspection has been estimated to account for 10% or more of the total labour costs for manufactured products. For these reasons, in many applications a batch inspection is carried out. In this case, a representative set of products is selected in order to perform inferential reasoning about the total. Nevertheless, in certain applications a ‘100%-inspection’ is required. This is the case of glass bottles fabricated for the wine industry, where it is necessary to ensure the safety of consumers. For this reason, it is necessary to check every part thoroughly. Defects in glassware can arise from an incompletely reacted batch, from batch contaminants which fail to melt completely, from interactions of the melted material with glass-contact refractories and superstructure refractories, and by devitrification. If conditions are abnormal many defects can be produced and even just one defect of only 1-2 mg in every 100 g article can be enough to give 100% rejection rates. The source identification of these defects can then be a matter of urgency [2]. The inspection of glass bottles is performed by examining each bottle through backlighting. In this case, the bottles are placed between light source and a human or computer aided inspector. This technique makes the defects of the bottle visible. There are two known approaches used in the inspection of glass bottles: The automated detection of flaws is performed by a typical pattern recognition schema (segmentation, feature extraction and classification), in which images from at least four view points are taken, potential flaws are segmented and according to the extracted features the defects are detected. Examples with neural networks can be found in [3,4,5], where a high detection performance was achieved in a laboratory prototype. In the second group, the image is taken by a linear scanner that stores the corresponding middle vertical line of the bottle. By rotating the bottle around its vertical axis, an extended image is acquired in which the whole bottle is represented . The flaws are detected by comparing the grey levels of the image with a threshold. Due to the required high-speed inspection (1 bottle/s), this method is employed in the glass industry of wine bottles. However, with this methodology only the body of the bottle can be satisfactorily inspected. Due to the edges of the regular structure of the bottleneck, the inspection requires a human operator for this part of the bottle. No results of this method are reported in the literature. In this paper, we present the results obtained in the inspection of (empty) wine bottles using a new technique in the inspection of glass. Nevertheless, the presented technique is not new in the automated inspection of aluminium die castings [6]. After the observation that the X-ray images acquired in the inspection of aluminium castings are similar to those photographic images obtained in the inspection of glass (see for example Fig. 1), we decided to investigate the
820
D. Mery and O. Medina
Fig. 1. (a) A flaw in an aluminium wheel. (b) Flaws in a glass bottleneck.
Fig. 2. Flaw detection in a glass bottle using a reference method.
inspection of glass bottles using a well known AVI technique for die castings, namely, the MODAN filter [7]. We demonstrate that this approach, based on the adapted median filtering, can be used in the automated quality control of wine bottles successfully. The rest of the paper is organised as follows: in Section 2 the adapted median filter is outlined; the results obtained using this technique is presented in Section 3; and finally, the concluding remarks are given in Section 4.
2
Adapted Median Filtering
The adapted median filtering is known as a reference method in the automated visual inspection of aluminium die castings [6]. In reference methods it is necessary to take still images at selected programmed inspection positions. The inspection process is illustrated in Fig. 2. The image of the object under test is compared with a defect-free image called the reference image. If a significant difference is identified then the test piece is classified as defective. In
Automated Visual Inspection of Glass Bottles
821
these approaches, the reference image is estimated from the test image using a filter consisting of several masks. The key idea of reference methods is that the masks of the filter are configured off-line from a training set of real defect-free images, and the filtering self is performed on-line. Thus, a fast on-line inspection is ensured. There are several reference methods used in the inspection of aluminium castings, however, as a result of its peak detection performance, the reference methods based on Modified Median (MODAN) filter [7] have become most widely established in industrial applications in this field [6]. With the MODAN– Filter it is possible to differentiate regular structures of the piece from defects. The MODAN–Filter is a median filter with adapted filter masks. If the background captured by the median filter is constant, it is possible that structures in the foreground will be suppressed if the number of values belonging to the structure is less than one half of the input value to the filter. This characteristic is utilised to suppress the defect structures and to preserve the design features of the test piece in the image. The goal of the adapted median filtering is to create a defect-free image from the test image. Thus, the MODAN–Filter is used in order to suppress only the defect structures in the test image. Locally variable masks are used during MODAN–filtering by adapting the form and size of the median filter masks to the design structure of the test piece. This way, the design structure is maintained in the estimated reference image (and the defects are suppressed). Additionally, the number of elements in the operator are reduced in order to optimise the computing time by not assigning all positions in the mask. This technique is known as a sparsely populated median filter [8]. Typically, only three inputs are used in the MODAN filter. In this case, the reference image is computed as: with
where and are the grey values at pixel in the test and reference images respectively. The filter direction of the masks is determined by the distances and Defects are detected when where is the threshold of pixel The parameters and are found in an off-line configuration process. For this task, a bank of 75 different filter masks with three inputs is used [9]. In the bank, there are masks of 3×3, 5×5, ..., 11×11 pixels. Some of them are shown in Fig. 3. Additionally, N training images of different pieces without defects are taken in the same position. A mask is selected for pixel when the objective function
D. Mery and O. Medina
822
Fig. 3. Some 5×5 masks used in a MODAN filter with 3 inputs.
is minimised. In the objective function, image of the training set for
is computed from the as:
where functions and denote the detection error, flagged false alarms, and the matrix size2. Threshold is computed from the training images as
With we ensure that no false alarm is flagged in all training images. However, it is convenient to set to give a larger confidence region. The selection of this parameter will be studied in next section. Thus, once the mask is selected, the error-free reference image is estimated on-line using (1) when condition (2) is satisfied.
3
Results
We evaluate the performance of the MODAN filter by inspecting glass bottlenecks, because this part of the bottle is the most difficult to inspect. In our experiments, 56 images (with and without flaws) of 320×200 pixels were taken from 7 (empty) wine bottles at 8 different viewpoints by rotating the bottle around its vertical axis. 20 images without flaws were selected for the training. The other 36 images were used for the inspection. In order to assess the performance of the inspection, the Receiver Operation Characteristic (ROC) [10] curve is analysed. The ROC curve is defined as a plot of the ‘sensitivity’ (Sn) against the ‘1-specificity’ (1 – Sp):
where 2
For three input values (see equation (1)), the mentioned functions are defined as follows: detection error is flagged false alarms is and matrix size is where the size of the largest mask in the bank is [9].
Automated Visual Inspection of Glass Bottles
823
Fig. 4. ROC curve for 36 test images.
Fig. 5. Test images and their corresponding detections (see false alarm in right detection) .
Fig. 6. Detection in images with simulated flaws: a) test image without flaws, b) test image with simulated flaws and its corresponding detection.
824
D. Mery and O. Medina
TP is the number of true positives (flaws correctly classified); TN is the number of true negatives (regular structures correctly classified); FP is the number of false positives (false alarms, i.e., regular structures classified as defects); and FN is the number of false negatives (flaws classified as regular structures). Ideally, and i.e., all flaws are detected without flagging false alarms. The ROC curve permits the assessment of the detection performance at various operating points (e.g., thresholds in the classification). The area under the ROC curve is normally used as a measure of performance because it indicates how reliable the detection can be performed. A value of gives a perfect classification, whereas corresponds to random guessing. A ROC curve was computed for the inspection of the 36 test images for The obtained area was and the best operating point was and i.e., 85% of the existing flaws were detected with only 4% of false alarms (see ROC curve in Fig. 4). The detection in two of the test images is illustrated in Fig. 5. In addition, the detection performance was evaluated in real images with simulated flaws. The simulated flaws were obtained using the technique of mask superimposition [11], where certain original grey values of an image without flaws are modified by multiplying the original grey value with a factor Fig. 6 shows the results obtained for In this example, only one simulated flaw was not detected, and there is no false alarm. In the simulation, the obtained area was Finally, we evaluate the computational time. In our experiments, we used a PC Athlon XP, 1.6 GHz with 128 MB RAM. The selection of the masks was programmed in Matlab. In this case, 7.5 hours were required to find the filters. The detection algorithm, on the other hand, was programmed in C. The median filtering was implemented considering that only three inputs are to be evaluated. The detection was achieved in only 0.3s/image.
4
Conclusions
In this paper, the implementation and evaluation of a well-known technique for inspecting aluminium castings was used for the automated visual inspection of glass bottles. The idea is to generate a defect-free reference image obtained from the original image of the inspection object. The reference image is compared with the original one, and defects are detected when the difference between them is considerable. The filter is configured off-line from a training set of real defectfree images, and the filtering itself is performed on-line. Thus, a fast on-line inspection is ensured. In our experiments, the detection performance was 85% and the false alarms rate was 4%. Additionally, the detection was achieved in only 0.3s/image. This means that the obtained computational time satisfies industrial requirements. It is very interesting to demonstrate that a well-known technique used in the automotive industry for inspecting aluminium die castings, can be used in the inspection of glass bottles. In this case, no modification of the original methodology was required.
Automated Visual Inspection of Glass Bottles
825
Acknowledgments. This work was supported by FONDECYT – Chile under grant no. 1040210.
References 1. Newman, T., Jain, A.: A survey of automated visual inspection. Computer Vision and Image Understanding 61 (1995) 231–262 2. Parker, J.: Defect in glass and their origin. In: First Balkan Conference on Glass Science & Technology, Vollos, Greace (2000) 3. Firmin, C., Hamad, D., Postaire, J., Zhang, R.: Gaussian neural networks for bottles inspection: a learning procedure. International Journal of Neural System 8 (1997) 41–46 4. Hamad, D., Betrouni, M., Biela, P., Postaire, J.: Neural networks inspection system for glass bottles production: A comparative study. International Journal of Pattern Recognition and Artificial Intelligence 12 (1998) 505–516 5. Riffard, B., David, B., Firmin, C., Orteu, J., Postaire, J.: Computer vision systems for tuning improvement in glass bottle production: on-line gob control and crack detection. In: Proceedings of the 5th International Conference on Quality Control by Artificial Vision (QCAV-2001), Le Creusot, France (2001) 6. Mery, D., Jaeger, T., Filbert, D.: A review of methods for automated recognition of casting defects. Insight 44 (2002) 428–436 7. Filbert, D., Klatte, R., Heinrich, W., Purschke, M.: Computer aided inspection of castings. In: IEEE-IAS Annual Meeting, Atlanta, USA (1987) 1087–1095 8. Castleman, K.: Digital Image Processing. Prentice-Hall, Englewood Cliffs, New Jersey (1996) 9. Heinrich, W.: Automated Inspection of Castings using X-ray Testing. PhD thesis, Institute for Measurement and Automation, Faculty of Electrical Engineering, Technical University of Berlin (1988) (in German). 10. Duda, R., Hart, P., Stork, D.: Pattern Classification. 2 edn. John Wiley & Sons, Inc., New York (2001) 11. Mery, D.: Flaw simulation in castings inspection by radioscopy. Insight 43 (2001) 664–668
Neuro-Fuzzy Method for Automated Defect Detection in Aluminium Castings Sergio Hernández1, Doris Sáez2, and Domingo Mery3 1
Departamento de Ingeniería Informática, Universidad de Santiago de Chile Av. Ecuador 3659, Santiago de Chile 2 Departamento de Ingeniería Eléctrica, Universidad de Chile Av. Tupper 2007, Santiago de Chile 3 Departamento de Ciencia de la Computación Pontificia Universidad Católica de Chile Av. Vicuña Mackenna 4860(143), Santiago de Chile
[email protected]
Abstract. The automated flaw detection in aluminium castings consists of two steps: a) identification of potential defects using image processing techniques, and b) classification of potential defects into ‘defects’ and ‘regular structures’ (false alarms) using pattern recognition techniques. In the second step, since several features can be extracted from the potential defects, a feature selection must be performed. In addition, since the two classes have a skewed distribution, the classifier must be carefully trained. In this paper, we deal with the classifier design, i.e., which features can be selected, and how the two classes can be efficiently separated in a skewed class distribution. We propose the consideration of a self-organizing feature map (SOM) approach for stratified dimensionality reduction for simplified model building. After a feature selection and data compression stage, a neuro-fuzzy method named ANFIS is used for pattern classification. The proposed method was tested on real data acquired from 50 noisy radioscopic images, where 23000 potential defects (with only 60 real detects) were segmented and 405 features were extracted in each potential defect. Using the new method, a good classification performance was achieved using only two features, yielding an area under the ROC curve Keywords: automated visual inspection, neuro-fuzzy methods, aluminium castings, ROC curves.
1
Introduction
Shrinkage as molten metal cools during the manufacture of die castings, can cause defect regions within the work piece. These are manifested, for example, by bubble-shaped voids, cracks, slag formations or inclusions (see examples in Fig. 1). Light-alloy castings produced for the automotive industry, such as wheel rims, are considered important components for overall roadworthiness. To ensure the safety of construction, it is necessary to check every part thoroughly. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 826–833, 2004. © Springer-Verlag Berlin Heidelberg 2004
Neuro-Fuzzy Method for Automated Defect Detection
827
Fig. 1. Radioscopic images of wheels with defects.
Radioscopy rapidly became the accepted way for controlling the quality of die cast pieces through computer-aided analysis of X-ray images [1]. The purpose of this non-destructive testing (NDT) method is to identify casting defects, which may be located within the piece and thus are undetectable to the naked eye. The automated visual inspection of castings is a quality control task to determine automatically whether a casting complies with a given set of product and product safety specifications. Two classes of regions are possible in a digital Xray image of an aluminium casting: regions belonging to regular structures of the specimen, and those relating to defects. In the computer-aided inspection of castings, the aim is to identify these two classes automatically. Data mining and image processing methods have been developed in a wide range of techniques for data treatment. Thus, it is possibly to apply several of these techniques for the defect detection task. Many approaches for automated defect detection in X-ray images have been used; these approaches included neural networks [2,3], statistical classifiers [3], fuzzy clustering [4] and fuzzy expert systems [5]. Typically, the automatic process used in fault detection in aluminium castings, as shown in Fig. 2, follows a pattern recognition methodology that can be summarised in two general steps [3]:
a) Identification of potential defects: Image formation: An X-ray image of the casting being tested is taken and stored in the computer. Image pre-processing: The quality of the X-ray image is improved in order to enhance the details of the X-ray image. Image segmentation: Each potential flaw of the X-ray image is found and isolated from the rest of the scene. b) Detection: Feature extraction: The potential flaws are measured and some significant characteristics are quantified. Classification: The extracted features of each potential flaw are analysed and assigned to one of the two following classes: ‘defect’ or ‘regular structure’.
828
S. Hernández, D. Sáez, and D. Mery
Fig. 2. Automatic process in fault detection in aluminium die castings [3].
In step a), the identification of real defects must be ensured. Nevertheless, using this strategy an enormous number of regular structures (false alarms) is identified. For this reason, a detection step is required. The detection attempts to separate the existing defects from the regular structures. In step b), since several features can be extracted from the potential defects, a feature selection must be performed. In addition, since the two classes show a skewed distribution (usually, there are more than 100 false alarms for each real defect), the classifier must be carefully trained. In this paper, we deal with the classifier design, i.e., which features can be selected, and how the two classes can be efficiently separated in a skewed class distribution. A self-organizing feature map (SOM) approach is used for stratified dimensionality reduction for simplified model building [6]. After a feature selection stage, a neuro-fuzzy method based on an adaptive-network-based inference system (ANFIS) [7] is used for the classification. The advantage of neuro-fuzzy systems is the combination of both properties: non linear learning based on numerical data and handling uncertainties in data. The rest of the paper is organised as follows: in Section 2 the pattern recognition using SOM and ANFIS is presented. Experiments and results on X-ray images are presented in Section 3. Finally, Section 4 gives concluding remarks.
2
Pattern Recognition Using SOM and ANFIS Algorithm
As explained in Section 1, the automated visual inspection follows a pattern recognition methodology. This Section presents the steps of the proposed method using SOM and ANFIS algorithms applied to the automated flaw detection of castings.
Neuro-Fuzzy Method for Automated Defect Detection
2.1
829
Identification of Potential Defects
The X-ray image taken with an image intensifier and a CCD camera (or a flat panel detector), must be pre-processed to improve the quality of the image. In our approach, the pre-processing techniques are used to remove noise, enhance contrast, correct the shading effect and restore blur deformation [8]. The segmentation of potential flaws identifies regions in radioscopic images that may correspond to real defects. Two general characteristics of the defects are used to identify them: a) a flaw can be considered as a connected subset of the image, and b) the grey level difference between a flaw and its neighbourhood is significant. According to the mentioned characteristics, a simple automated segmentation approach based on a LoG operator was suggested in [9]. This is a very simple detector of potential flaws with a large number of false alarms flagged erroneously. However, the advantages are as follows: a) it is a single detector (it is the same detector for each image), b) it is able to identify potential defects independently of the placement and the structure of the specimen, i.e., without a-priori information of the design structure of the test piece, and c) the detection rate of real flaws is very high (more than 95%). In order to reduce the number of the false alarms, the segmented regions must be measured and classified into one of the two classes: regular structure or defect. In the following sections, the detection of defects will be explained in further detail.
2.2
Feature Extraction and Feature Selection
Features are used for representing original data in a lower dimension space. Features extracted can be divided into two groups: geometric features (area, perimeter, invariant moments, etc.) and intensity features (mean gray value, texture features, Karhunen-Lóeve coefficients, Discrete Cosine Transform coefficients, etc.) [3]. In order to build a compact and accurate model, irrelevant and redundant features are removed. The Correlation-based Feature Selection (CFS) method takes into account the usefulness of individual features for class discrimination, along with the level of inter-correlation among them [10].
2.3
Stratified Dimensionality Reduction Using SOM
In the proposed approach, SOM is used for stratified dimensionality reduction for model simplification. Skewed class distributions can lead to an excessive complexity in decision boundaries construction, so to create a reduced representation of the original data is necessary. In the stratified dimensionality reduction approach, the idea is to have an economic representation of the whole dominant class without loss of knowledge of the internal relationships among features. SOM is performed using neural networks. The approach transforms a high dimensional input space to a low order discrete map. This mapping has the particularity that it preserves input data topology while performing dimensionality reduction of this space. Every processing unit of the map is associated with an reference vector, where denotes the dimension of the input
830
S. Hernández, D. Sáez, and D. Mery
vectors. Weight updating is done by means of a lateral feedback function and winner-take-all learning, and this information forms a codebook. In this work a SOM codebook of the dominant class is used as new training data for the next stage of classification. Thus, SOM contributes to the stratified dimensionality reduction, but in addition, this approach introduces other benefits like computational load decrease and noise reduction [6].
2.4
Pattern Classification Using ANFIS
Pattern classification attempts to assign input data to a pre-defined class. In our approach, an ANFIS algorithm is used for supervised classification [11]. ANFIS is a hybrid network model equivalent to a Takagi-Sugeno fuzzy model, which means that a rule base can be expressed in terms of fuzzy ‘if-then’ rules like:
where A and B are fuzzy sets in the antecedent, and is a crisp function of the consequent. In this type of controller the defuzzification stage is replaced by a weighted average of incoming signals from each node in the output layer. The resulting adaptive network can be viewed as shown in Fig. 3, where is the output of each node in the second layer, which multiplies the incoming signals and outputs the product. This value actually represents the firing strength of a rule which is normalised in the next layer. Each node is a process unit which performs a function on its incoming signals to generate a single node output [11]. This node function is a parameterised function with modifiable parameters. If the parameter set in a node is non-empty, then the node is an adaptive node an is represented as a square. On the other hand, if the parameter set is empty, there is a fixed node, which is represented as a circle in the diagram. In this paper, the ANFIS system is used for pattern classification into defects and regular structures. Fuzzy ‘if-then’ rules are extracted numerically from data and defines a mapping between extracted features from radiographic image data
Fig. 3. ANFIS architecture [7].
Neuro-Fuzzy Method for Automated Defect Detection
831
and decision boundaries for defect detection. These features become fuzzy sets and fuzzy numbers rather than crisp values, achieving robustness in the decision making process with an approximate reasoning based solution.
2.5
Evaluation Basis
Once the classification is carried out, a performance evaluation is required. The area under the Receiver Operation Characteristic (ROC) curve is commonly used for classifier performance for two class problems [12]. This metric provides a scalar unit which represents overall mis-classification and accuracy rates, discarding unbalanced class distribution effect. The ROC curve is defined as a plot of the ‘sensitivity’ (Sn) against the ‘1-specificity’ (1 – Sp):
where TP is the number of true positives (flaws correctly classified); TN is the number of true negatives (regular structures correctly classified); FP is the number of false positives (false alarms, i.e., regular structures classified as defects); and FN is the number of false negatives (flaws classified as regular structures). Ideally, and i.e., all flaws are detected without flagging false alarms. The ROC curve permits the assessment of the detection performance at various operating points (e.g., thresholds in the classification). The area under the ROC curve is normally used as performance measure because it indicates how reliably the detection can be performed. A value of gives perfect classification, whereas corresponds to random guessing.
3
Experiments and Results
In our experiments, 50 X-ray images of aluminium wheels were analysed. In the segmentation 22936 potential flaws were obtained, in which there were only 60 real flaws, i.e., the skew is 381:1. Some of the real defects were existing blow holes. The other defects were produced by drilling small holes in positions of the casting which were known to be difficult to detect (see examples in [9]). For each potential defect, 405 features were extracted. Detailed description of this data set can be found in [3]. The feature selection method evaluated 4009 subsets in a total space of 405 features. The selected features are intensity features obtained from 32 × 32 pixels containing the potential defect and neighbourhood: a) feature 37: first coefficient of Discrete Fourier Transform component of best ‘Crossing Line Profile’ [13]; and b) feature 360: coefficient (3,3) of Discrete Cosine Transform [3]. The selected features are used for the complete and simplified ANFIS model building. The dominant class (‘regular structures’) has 22876 prototypes and
832
S. Hernández, D. Sáez, and D. Mery
the other class (‘defects’) has only 60 instances. The cmplete ANFIS model is performed using a training set with a sample (70%) of each class, and the other instances (30%) as a checking set for model validation. Classifier performance for this model (16055 training patterns and 6881 checking patterns) is Another training set is made using SOM codebook vectors from the dominant class. The simplified model uses SOM algorithm for reducing the 22876 instances from dominant class (100% of ‘regular structures’ patterns). The resulting codebook vectors and other 60 instances from the minority class (100% of ‘defect’ patterns) makes up the training set for this model. Classifier performance for this model (794 training patterns) is The false alarm rate obtained with this method is 0.55080% of the total hypothetical flaws (2.52 false alarms per image), and defect detection is 95% accurate. This result outperforms false alarm rate of 1.00279% (4.60 false alarms per image) reported in the literature with the same data [3], in which a threshold classifier was used. Table 1 summarises the results for complete and simplified ANFIS models in the radioscopic data and the results obtained in previous work.
4
Conclusions
Two-stage simplified model building outperforms classification performance of a complete ANFIS model. Although this improvement in classifying is not determinant, a simplified model improves results for computational workload and speed. Sensitivity analysis using the CFS method had good results in classifier building with this data set. Although there are powerful wrapper learning schemes for attribute selection, a good trade-off between results accuracy, attributes interaction identification and computation time in large data sets handling is provided by this method. Results obtained are concordant with previous work using a Fischer discriminant for attribute selection [3], i.e., intensity features has better discriminant power for flaw detection than geometric features, so further research with this data can be done, including further intensity information, like wavelet components for the segmented images. The main contribution of this research was the use of SOM for dimensionality reduction and the neuro-fuzzy method ANFIS for the pattern classification task. Neural networks have an inherent ability to recognise overlapping pattern classes with highly nonlinear boundaries. On the other hand, soft computing hybridizations provides another information processing capability for handling uncertainty
Neuro-Fuzzy Method for Automated Defect Detection
833
from the feature extraction stage. Uncertainty handling of the feature space by means of fuzzy sets can be highly useful, even when no prior knowledge of data topology or expert opinions are available, but there is a need for a more powerful learning architecture for reduction of false positives. The best performance was achieved using the simplified ANFIS model. That means, that only 2.52 false alarms per image are obtained in the identification of potential flaws (at Acknowledgment. This work was supported in part by FONDECYT – Chile under grant no. 1040210 and project DI I2-03/14-2 from the Universidad de Chile.
References l. Mery, D., Jaeger, T., Filbert, D.: A review of methods for automated recognition of casting defects. Insight 44 (2002) 428–436 2. Aoki, K., Suga, Y.: Application of artificial neural network to discrimination of defect type automatic radiographic testing of welds. ISI International 39 (1999) 1081–1087 3. Mery, D., da Silva, R., Caloba, L., Rebello, J.: Pattern recognition in the automatic inspection of aluminium castings. Insight 45 (2003) 475–483 4. Liao, T., Li, D., Li, Y.: Detection of welding flaws from radiographic images with fuzzy clustering methods. Fuzzy Sets and Systems 108 (1999) 145–158 5. Liao, T.: Classification of welding flaw types with fuzzy expert systems. Fuzzy Sets and Systems 108 (1999) 145–158 6. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11 (2000) 586–600 7. Jang, J.S.: ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics 23 (1993) 665–684 8. Boerner, H., Strecker, H.: Automated X-ray inspection of aluminum casting. IEEE Trans. Pattern Analysis and Machine Intelligence 10 (1988) 79–91 9. Mery, D., Filbert, D.: Automated flaw detection in aluminum castings based on the tracking of potential defects in a radioscopic image sequence. IEEE Trans. Robotics and Automation 18 (2002) 890–901 10. Hall, M.: Correlation-Based Feature Selection for Machine Learning. PhD thesis, Waikato University, Department of Computer Science, NZ (1998) 11. Jang, J.S., Sun, C.: Neuro-fuzzy modeling and control. Proceedings of the IEEE 83 (1995) 378–406 12. Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley & Sons, Inc., New York (2001) 13. Mery, D.: Crossing line profile: a new approach to detecting defects in aluminium castings. Lecture Notes in Computer Science 2749 (2003) 725–732
Online Sauter Diameter Measurement of Air Bubbles and Oil Drops in Stirred Bioreactors by Using Hough Transform L. Vega-Alvarado 1, M.S. Cordova2, B. Taboada1, E. Galindo2, and G. Corkidi1 1
Centro de Ciencias Aplicadas y Desarrollo Tecnológico, UNAM. P.O. Box 70-186, México 04510, D.F {vegal,btaboada,corkidi}@ibt.unam.mx
2
Instituto de Biotecnología, UNAM. P.O. Box 510-3, 62250, Cuernavaca, Morelos, México {cordova,galindo}@ibt.unam.mx
Abstract. Industrial production of important fermentation products such as enzymes, antibiotics, and aroma compounds, amongst others, involves a multiphase dispersion. Therefore, it is important to determine the influence of the bioreactor operational parameters (stirring speed, impeller type, power drawn, etc.) under which the culture achieves the highest yields. The automatic on-line analysis of multiphase dispersions occurring in microbial cultures in mechanically stirred bioreactors, presents a number of important problems for image acquisition and segmentation, including heterogeneous transparency of moving objects of interest and background, blurring, overlapping and artifacts. In this work, a Hough transform based method was implemented and tested. Results were compared with those obtained manually by the expert. We concluded that using this method, the evaluation of size distributions of air bubbles and oil drops in a mechanically stirred bioreactor was performed in a more efficient and less time-consuming way than others semiautomatic or manual methods.
1 Introduction Fermentation industry currently produces a wide range of products. Many fermentation processes involve the mixing of up to four phases (solid -the biomass, hydrodynamically important in the case of fungal mycelial type-, liquid -which provide the carbon source usually in the form of an inmiscible oil or constitute the extraction phase by using a solvent- and gaseous -the air which provides oxygen by dispersing bubbles in the liquid medium) [1]. Therefore, it is important to determine the influence of the bioreactor operational parameters (stirring speed, impeller type, power drawn, etc.) over the efficiency of the dispersions and ultimately on culture performance. Several methods have been proposed for performing measurements of bubble size in two and three phases. Some works have been conducted by analysis of image collected using photographic [2,3,4,5,6] or video cameras [7,8]. The analysis of photographs or videos is a tedious and costly activity, involving a relatively long processing time period, because images are manually processed. Some recent works report the use of image processing techniques [9,10] in bubbles size measuring; however, A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 834–840, 2004. © Springer-Verlag Berlin Heidelberg 2004
Online Sauter Diameter Measurement of Air Bubbles and Oil Drops
835
the results reported by these works are based in some experimental ideal conditions, for example, non-overlapping bubbles or considering the same size for all bubbles. In some others works [11,12] no details have been reported about the image acquisition and processing processes. Papers published reveal the complexity of the automatic measurement of bubble size distribution in multiphase systems. One of the main problems is the difficulty of acquiring images in motion clear enough to characterize the elements involved in the culture (air, oil and biomass, all immersed in an aqueous solution of salts). On the other hand, the diversity and complexity of air bubbles and oil drops, as well as the presence of artifacts, the heterogeneous transparency of some objects and the background, their low contrast, blurring, overlapping and/or similarity of classes complicate the automation of the image analysis process. The purpose of this work is to present a system that allows the on-line acquisition and analysis of images inside a mechanically stirred tank. The acquired images are pre-processed and reduced to arc segments of one-pixel-width. Then air bubbles and oil drops are detected in this image by the application of the Hough transform implementation for Sauter diameter distribution evaluation, proving the convenience of this technique for this on-line application.
2 Image Acquisition and Pre-processing 2.1 Image Acquisition The acquisition of images in motion presents several difficulties. The most evident comes from the moving velocity of the particles being analyzed in contrast to the limitations of the sensors used to capture such images. To avoid the use of expensive fast acquisition digital or analogical video cameras, it was used a conventional video camera synchronized to the flashing of a stroboscope. This type of illumination, provided by the immersion of an optical probe in the stirred tank, had allowed to decrease the frame rate required for obtaining sharp and non-overlapped interlaced images [13]. Moreover, the high luminescence provided by the stroboscopic lighting also helped substantially to avoid this problem by diminishing field darkening (Fig. 1a,d).
2.2 Image Pre-processing In this work, the original image was pre-processed to extract arc segments or isolated points associated with the bubbles and drops to be identified. The purpose of this is to provide -as input-, a binary one-pixel-width skeleton image containing the main primitives of the shapes to search, in order to find the objects of interest by the application of the Hough transform. These primitives are easy to find by a simple gray level threshold, since oil drops have black arc segments in their borders and air bubbles are fully dark. The following filters and operators are sequentially applied to the original image: a median filter to remove impulse noise, a ‘flatten’ filter for background correction and a ‘well’ filter to enhance pixels that are darker than background.
836
L. Vega-Alvarado et al.
Later, a morphological ‘opening’ operation is applied to joint neighboring arc segments. Finally, the enhanced arc segments are reduced to a one-pixel-width skeleton using the medial axis transform (Fig 1b and 1e). As a result, the information to be processed by the Hough transform is dramatically reduced, making the procedure computationally feasible. The pre-process mentioned above was achieved with a commercial imaging software [14].
3 Segmentation One problem with many shape detection algorithms is that the shape points are not linked together in any way. An approach to link points into defined shapes is to perform a Hough Transform (HT) [15]. Some other widely used methods like, active contour (‘snakes’) [16] requires a given starting condition, not been an easy task in our application due to overlaps and embedded objects. Region growing techniques to detect connected regions have this same starting difficulty and are equally not appropriate for overlapped objects [9]. Other methods based on the identification of curvature extrema [17,18] are very sensitive to noise. Particularly, Liping Shen et al [19] have proposed a promising method based on an area correlation coefficient to cluster the circular arcs belonging to the same circle. Nevertheless, this method was tested with only one computer generated sample image. In order to recognize bubbles and drops, HT is a particularly useful application since it is relatively a simple method to detect circular shapes. However, HT suffers from many difficulties stemming from binning the circumferences. The accumulator’s bin sizes are determined by windowing and sampling the parameter space in a heuristic way. To detect circumferences with a high accuracy, there must exist a high parameter resolution that requires a large accumulator and much processing time [20]. To reduce this problem, we propose to use only the points of the detected arc segments, to calculate centers and radius of the diverse circumferences of bubbles and drops present in the original image. It should be pointed out that the amount of information provided to the Hough transform by these images is substantially reduced (around 97% for the showed examples), as compared with the amount of information corresponding to the whole gray level image if fully processed.
4 Surface-Volume Mean Sauter Diameter The Sauter diameter is extensively used in the characterization of liquid/liquid or gas/liquid dispersions, This usage arises because it links the area of the dispersed phase to its volume and hence to mass transfer and chemical reaction rates [21]. The Sauter diameter for any size distribution of discrete entities is defined in equation (1) as:
Online Sauter Diameter Measurement of Air Bubbles and Oil Drops
837
where k is the number of bins, is the number of drops in bin i and the size of bin i. Hence, the surface-volume mean Sauter diameter can be calculated by providing single segmented bubbles and drops diameters into this equation.
5 Application and Evaluation in Simulated Fermentation Procedures This methodology was assessed in two kinds of experiments of a simulated fermentation model system for the production of the g-decalactone Trichoderma harzianum. The first experiment included three phases: aqueous salt rich media, castor oil and 0.5g/l of mycelia of Trichoderma harzianum. The second experiment included the air as the fourth phase. For each experiment 30 images were taken, where oil drops size distribution was evaluated for both the expert and the methodology proposed in this work. In the case of the second experiment, air bubbles size distribution was also evaluated. Experimental rig and conditions for image capture and dispersion studies have been described before [13].
Fig. 1. Original, pre-processed and resulting images for Hough transform application. a) Acquired original image (water-oil) with stroboscopic illumination. b) Pre-processed image c) Resulting segmentation. d) Acquired original image (water-oil-air) with stroboscopic illumination. e) Pre-processed image. f) Resulting segmentation.
6 Results Preliminary experiments indicate that it is necessary to quantify at least 300 oil drops and 300 air bubbles in order to ensure a representative set [22]. As seen in figure 2, the bubbles and drops size distributions evaluated by the expert and by the system in
838
L. Vega-Alvarado et al.
both experiments are similar (0.95, 0.96, and 0.98 respective correlation for figures 2a,b,c).
Fig. 2. Oil drops and air bubbles size distribution. a) oil drops distribution of experiment 1. b) air bubble distribution of experiment 2 and c) oil distribution of experiment 2.
Table 1 shows that the Sauter diameter difference between the expert and the system is around 1.5% for experiment 1, and of 1.3% - 10% for drops and bubbles respectively for experiment 2. The maximum percent of false positive was 12%.
Online Sauter Diameter Measurement of Air Bubbles and Oil Drops
839
7 Conclusion In this work, a Hough transform-based method to quantitatively evaluate on-line the size distributions of air bubbles and oil drops in a mechanically stirred bioreactor was implemented and tested. The Hough transform was applied to the skeleton of the original image to reduce the amount of information to be processed. Results as compared with those obtained manually by the expert showed a good correlation, and were obtained in a more efficient and less time-consuming way. This allows to analyze more comprehensive amounts of information which will contribute to accurately estimate the interfacial area (v.g. by calculating the mean Sauter diameter) and therefore to characterize the transfer efficiency of nutrients to the microbial culture. Currently, these results are providing us with very valuable and accurate information to manipulate the mechanical and biochemical parameters in the fermentation process to obtain the best performance in the production of aroma compounds such as (a peach-like aroma) produced by the fungus Trichoderma harzianum [23]. The developed Hough transform based method has proved to be very useful for this on-line application. Acknowledgment. This work was partially supported by DGAPA-UNAM grant IN117202.
References 1. 2. 3. 4.
Cordova A.M.S., Sanchez A., Serrano C.L., Galindo E.: Oil and Fungal Biomass Dispersion in Stirred Tank Containing a Simulated Fermentation Broth, J.Chem.Technol. Biotechnol, 76, 1101-1106, (2001). Chen H.T., Middleman S.: Drop Size Distributions in Stirred Liquid-liquid Systems, AIChE J, 13, (5), 989-998, (1967). Varley J., Submerged.: Gas-liquid jets: bubble size prediction, Chemical Engineering Science. 50(5): 901-905, (1995). Lage P.L and Esposito R.O.: Experimental determination of bubble size distributions in bubble columns: prediction of mean bubble diameter and gas hold up. Powder Technology. 101: 142-150, (1999).
840
5. 6. 7.
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
L. Vega-Alvarado et al. Chen F., Gomez C. O., and Finch J.A.: Technical note bubble size measurement in flotation machines, Minerals Engineering, 14(4): 427-432, (2001). Lou R., Song Q., Yang X.Y., Wang Z.: A three-dimensional photographic method for measurement of phase distribution in dilute bubble flow, Experimentes In Fluids, 32: 116120, (2002). Malysa K., Cymbalisty L., Czarnecki J. and Masliyah J.: A method of visualization and characterization of aggregate flow inside a separation vessel, Part 1. Size, shape and rise velocity of the aggregates. Internationa Journal of Mineral Processing. 55: 171-188, (1999). Zhou Z., Xu Z., Masliyah J., Kasongo T., Christendat D., Hyland K., Kizor T. and Cox D.: Application of on-line visualization to flotation system. Proc. Of the 32nd Annual Operator’s Conferences of the Canadian Mineral processors. 120-137, (2000). Pan X-H, Lou R, Yang X-Y And Yang H-J.: Three-dimensional particle image tracking for dilute particle-liquid flows in a pipe, Measurements Science And Technology 13:1206-1216, (2002). Schäfer R., Merten C. and Eigenberger G.: Bubble size distribution in bubble column reactor under industrial conditions. Experimental Thermal And Fluid Science. 26:595604, (2002). So S., Morikita H., Takagi S., Matsumoto Y.: Laser doppler velocimetry measurement of turbulent bubbly chanel flow, Experimentes in Fluids, 33: 135-142, (2002). Takamasa T., Goto T., Hibiki T., Ishii M.: Experimental study of interfacial area transport of bubbly flow in small-diameter tube. International Journal of Multiphase Flow. 29: 395409,(2003). Taboada B., Larralde P., Brito T., Vega –Alvarado L., Díaz R., Galindo E., Corkidi G.: Images Acquisition of a Multiphase Dispersions in Fermentation Processes, Journal of Applied Research and Technology. 1(1): 78-84, (2003). Image-Pro Plus V.4.1, Reference Guide for Windows, (Media Cybernetics,EUA), (1999). P.V.C. Hough.: Methods and means for recognizing complex patterns, U.S Patent3, 069, 654, (1962). Kass, Witkin A. and Terzopoulos D.: Snakes: Active contour models. Porceed. Of first International Conf. On Comp. Vis. 259-269, (1987). Lim K., Xin K. and Hong G.: Detection and estimation of circular segments. Pattern Recog. Lett. 16: 627-636, (1995). Pla F.: Recognition of partial circular shapes from segmented contours. Comp. Vis. and Imag. Underst. 63(2): 334-343, (1996). Shen L., Song X., Iguchi M. and Yamamoto F.: A method for recognizing particles in overlapped particle images. Pattern Recog. Lett. 21: 21-30, (2000). Lei XU, Erkki OJA, and Pekka Kultanena.: A new curve detection method: Randomized hough transform (rht). Pattern Recognition Letters, (11):331–338, (1990). Pacek C., Man C. and Nienow A.: Chemical Engineering Science, 53(11): 2005-2011, (1998). S. Lucatero, C. Larralde-Cornona, G. Corkidi and E. Galindo.: Oil and air dispersion in a simulated fermentation broth as a function of mycelial morphology. Biotechnol. Prog. 19:285 – 292, (2003). L. Serrano-Carreón, C. Flores, and E. Galindo.: Production by Trichoderma harzianum in Stirred Bioreactors. Biotechnol. Prog., 13, 205-208, (1997).
Defect Detection in Textile Images Using Gabor Filters Céu L. Beirão1 and Mário A.T. Figueiredo2 1
Escola Superior de Tecnologia Instituto Politécnico de Castelo Branco 6000-767 Castelo Branco, Portugal
[email protected] 2
Instituto de Telecomunicações Instituto Superior Técnico 1049-001 Lisboa, Portugal
[email protected]
Abstract. This paper describes various techniques to detect defects in textile images. These techniques are based on multichannel Gabor features. The building blocks of our approaches are: a modified principal component analysis (PCA) technique, to select the most relevant features; one-class classification techniques (a global Gaussian model, a nearest neighbor method, and a local Gaussian model). Experimental results on synthetic and real fabric images testify for the good performance of the methods considered.
1
Introduction
In the textile industry, several attempts have been made towards replacing manual inspection by automatic visual inspection (AVI). Textile fabric AVI aims at low cost, high speed defect detection, with high accuracy and robustness [1]. In this paper, we present several multichannel Gabor filtering-based techniques for segmentation of local texture defects. Gabor filters achieve optimal joint localizations in the spatial and spatial frequency domains [2] and, therefore, have been successfully applied in many image processing tasks [3], [4], [5], [6]. In regular textures, the defects are perceived as irregularities. We calculate the Gabor features that characterize the texture and model the defects as outliers in feature space. As only information of one of the classes (normal class, without defect) is available for training, several one-class classification techniques are considered: a global Gaussian model, the first nearest neighbor (1NN) method, and a local Gaussian model. All these techniques provide ways to detect outliers in the adopted feature space. The paper is organized as follows: in section 2, we briefly describe how we use Gabor filters to obtain the texture features. The three one-class techniques studied are described in Section 3. Section 4 presents experimental results, and Section 5 some final conclusions. A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 841–848, 2004. © Springer-Verlag Berlin Heidelberg 2004
842
2
C.L. Beirão and M.A.T. Figueiredo
Gabor Functions
In the spatial domain, each Gabor function is a complex exponential modulated by a Gaussian envelope and has, in the 2-D plane, the following form [7], [8]
with
where represents the radial frequency, the space constants and define the Gaussian envelope widths along the and axes, respectively, and is the orientation of the filter. Convolution of an image with a Gabor function performs oriented bandpass filtering. A bank of S × L Gabor filters with S scales (dilations of (1)) and L orientations (values of is considered in this work:
with
where and is a set of scales and is a set of orientations. Each of the S × L Gabor filters in the filter bank is applied to an image and the magnitude of the filtered output is given by
where denotes the 2-D convolution operation, while and represent the even and odd parts of the corresponding Gabor filter (i.e., the cosine and sine parts of the complex exponential). As a compromise between computational complexity and performance, 16 Gabor filters are considered. There is psychophysical evidence that the human visual system uses a similar number of channels [10]. We consider circular symmetric filters [11], distributed by four scales (S = 4) and four orientations (L = 4). In practice, each of these filters is implemented by convolution with a 7 × 7 spatial mask. The frequency range of the Gabor filters in the filter bank depends on the frequency range of the defects to be detected [9]. In this paper, the frequency values used are 1/2, 1/4, 1/8, and 1/16 cycles/pixel. The set of rotation angles used is 0, 45, 90, and 135 degrees. These choices will be shown to be able to simultaneously detect both large and small defects.
Defect Detection in Textile Images
3
843
Proposed Approach
As mentioned in the introduction, defect detection corresponds to finding outliers in the Gabor feature space. For that, one (or more) training image of defect-free fabric is available. This image(s) is used to obtain a training set of 16-dimensional features where each contains the 16 Gabor magnitudes, at some pixel of the training image, computed according to (5). The test image is convolved with the same bank of Gabor functions. The magnitude of every filtered image is again computed using (5). This yields a test set where each contains the set of filtered outputs for and for each pixel.
Fig. 1. One-class classification techniques used (a) a global Gaussian model; (b) the first nearest neighbor (1NN) method and (c) a local Gaussian model.
The goal is to decide, for each test sample (generically denoted simply as z), if it is, or not, an outlier with respect to the training set This can be seen as a one-class classification problem, to which we will apply the three techniques compared in [17], which are summarized in Fig. 1 and next described.
3.1
The Global Gaussian Model
The mean and the covariance matrix C of the training set by the standard expressions
are computed
The dimension of the feature space is reduced by using a variant of principal component analysis (PCA) [12], From C, we obtain an orthonormal basis by finding its eigenvalues and corresponding eigenvectors where the eigenvalues are sorted in descending order. Let D be the diagonal matrix of (sorted) eigenvalues and V the matrix whose columns are the corresponding eigenvectors. The eigenvectors corresponding to the largest
844
C.L. Beirão and M.A.T. Figueiredo
eigenvalues are the principal axis and point in the directions of the largest data variance. Usually, the first, say principal components (eigenvectors), which explain some large fraction of the variance (90%, in this paper) are kept. Let be the matrix formed by the first columns of V. For each test sample z (feature vector of the test set a “distance” is computed as Notice that, unlike in standard PCA, we are not normalizing by the eigenvalues. We have found experimentally that this choice leads to much better results; one reason for this may be that, by normalizing, we are giving most of the weight to the directions of largest variance, which may not be the best for outlier detection. Without this normalization, even if z is Gaussian, is no longer chi square distributed (as in PCA), and we can not use chi square tests to detect the outliers. However, if Z is a Gaussian random vector of mean and covariance C, although has a complicated distribution, its mean and variance are still easy to compute (see, e.g. [15], page 64),
where tr denotes the trace of a matrix. Because A contains the first of C (i.e., the first columns of V), it can be shown that
eigenvectors
These mean and variance allow us to establish a threshold for z can be considered an outlier:
above which
where is a factor that controls the sensitivity of the detector. In the experiments reported below, we use which was determined empirically to be suitable for a wide range of defects.
3.2 The First Nearest Neighbor (1NN) Method This method is based on the nearest neighbor classifier [12], [16]. For each test object z, its distance to the first nearest neighbor in denoted is computed. Next, the distance from this training sample to its nearest neighbor in i.e., is also obtained (see Fig. 1 (b)). The outlier detection criterion consists in comparing the quotient of these two Euclidian distances,
with a threshold value where
These threshold is given by
determines the sensitivity. An empirically determined value of was found suitable and used in the experiments reported below.
Defect Detection in Textile Images
3.3
845
Local Gaussian Model
The local Gaussian model is a compromise between the previous two. It is less global than the global Gaussian method, but less local than the nearest neighbor method. For each test object z, the nearest neighbors in are determined. The value was found suitable and used in this work. Let denote the sub-set of formed by nearest neighbors of z. For the mean and the local covariance matrix are computed. Then, the exact same method used in the global Gaussian model is applied using and
3.4
Postprocessing
For a given test image each of the three methods described produces a binary image with ones at the positions of the detected defects. We have found useful to apply a 7 × 7 median filter to this binary image This postprocessing step allows eliminating isolated false detections.
4
Experimental Results
The performance of the implemented methods has been evaluated in more than one hundred images. The algorithms have been tested on real and synthetic images. The reason for testing these algorithms on synthetic images was to ensure that they are able to detect difficult fabric defects, e.g., such that the gray value of the defects is equal, or close, to the image average gray values (see Figs. 2(a)). These defects are not easy to detect, even by a human observer. Figs. 2 and 3 illustrate the results achieved with the proposed algorithms. Figs. 2(a) show synthetic fabric images, with the detected defects shown in Figs. 2(b)-(d). Figs. 3(a), show real fabric images, with the detection results shown in Fig. 3(b)-(d). Our results show that these methods perform well. A quantitative measure to evaluate and compare different methods is not easy to define, as it is impossible to clearly state which pixels correspond to defects. Thus, our performance evaluation is based on visual assessment only. The computational time of the global Gaussian algorithm is much smaller than that of the other two algorithms. Although this method produces more false alarms (see Figs. 2(b) and 3(b)), it has an excellent performance/cost ratio. As intuitively expected, the computational time of 1NN and local Gaussian algorithms are considerable. The 1NN algorithm computational time is almost twice that of the local Gaussian algorithm but has better performance.
5
Conclusions
We have described three methods for the detection of defects on textile images using multichannel Gabor filtering and one-class classifiers. The algorithms have been tested on both synthetic and real images with success, and the results are shown in this paper. These results prove that these algorithms are candidates to being used in real applications.
846
C.L. Beirão and M.A.T. Figueiredo
Fig. 2. Synthetic fabric sample: (a) with defect; with segmented defect using: (b) a global Gaussian model; (c) the 1NN method; (d) a local Gaussian model.
Defect Detection in Textile Images
847
Fig. 3. Real fabric sample: (a) with defect; with segmented defect using: (b) a global Gaussian model; (c) the 1NN method; (d) a local Gaussian model.
848
C.L. Beirão and M.A.T. Figueiredo
References 1. Cohen, F., Fan, Z., Attali, S.: Automated inspection of textile fabrics using textural models. IEEE Trans. of Patt. Anal. and Mach. Intell., vol. 13 (1991), pp. 803–808. 2. Daugman, J.: Uncertainly relation in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Jour. Opt. Soc. Amer., vol. 2 (1985), pp. 1160–1169. 3. Jain, A., Farrokhnia, F.: Unsupervised segmentation using Gabor filters. Pattern Recognition, vol. 23 (1991), pp. 1167–1186. 4. Jain, A., Bhattacharjee, S.: Text segmentation using Gabor filters for automatic document processing. Mach. Vis. Appl., vol. 5 (1992), pp. 169–184. 5. Jain, A., Ratha, N., Lakshmanan, S.: Object detection using Gabor filters. Pattern Recognition, vol. 30 (1997), pp. 295–309. 6. Daugman, J.: Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Trans. Acoust. Speech, and Signal Proc., vol. 36 (1988), pp. 1169–1179. 7. Manjunath, B., Ma, W.: Texture features for browsing and retrieval of image data. IEEE Trans. Patt. Anal. and Mach. Intell., vol. 18 (1996), pp. 837–842. 8. Tang, H., Srinivasan, V., Ong, S.: Texture segmentation via non linear interactions among Gabor features pairs. Optical Eng., vol. 34 (1995), pp. 125–134. 9. Kumar, A., Pang, G.: Defect detection in textured materials using Gabor filters. IEEE Trans. Industry Applications, vol. 38 (2002). 10. Daugman, J.: Spatial visual channels in the Fourier plane. Vision Research, vol. 24 (1984), pp. 891–910. 11. Dunn, D., Higgins, W.: Optimal Gabor filter for texture segmentation. IEEE Trans. Image Processing, vol. 4 (1995), pp. 947–963. 12. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. 13. Duda, R., Hart, P.: Pattern classification and scene analysis. John Wiley, New York, 1973. 14. Gonzalez, R., Woods, R.: Digital image processing. Addison Wesley Publishing Company, 1992. 15. Scharf, L.: Statistical signal processing: detection, estimation, and time series analysis. Addison Wesley Publishing Company, 1991. 16. Dasarathy, B.: Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, California, 1990. 17. Ridder, D., Tax, D., Duin, R.: An experimental comparison of one-class methods. Proc. 4th Annual Conf. of the Advanced School for Computing and Imaging, Delft (1998), pp. 213–218.
Geometric Surface Inspection of Raw Milled Steel Blocks Ingo Reindl and Paul O’Leary Christian Doppler Laboratory for Sensor and Measurement Systems, Institute for Automation, University of Leoben, Peter-Tunner-Strasse 27, 8700 Leoben, Austria
Abstract. In the production of raw steel blocks the surface should be free of surface defects. Flaws degrade the quality and deteriorate the treatment in subsequent production steps. The surface of the steel blocks is covered with scale which causes a strongly alternating reflectance property. Due to this fact traditional intensity imaging techniques yield inferior performance. Therefore, light sectioning in conjunction with fast imaging sensors is applied to gather the range image of the steel block. Once the surface height data have been acquired, they must be analyzed with respect to unwanted cavities on the surface. Three different methods for the surface approximation treated, whereby the first algorithm is based on a line-wise examination of the acquired profiles by unwrapping the surface using spline interpolation. The further two methods refers to surface segments based on polynomials and singular value decomposition.
1
Introduction
In steel industry there is an increasing request for automatic inspection systems to control the quality of the products. The customer demands are well-founded on the high costs of correction of poor quality. Several papers on surface inspection of steel products have recently been published [2,3]. Basically, two different approaches for acquiring the surface image are considered in the literature: Intensity imaging with diffuse or bright and dark field illumination. Range imaging methods e.g. light sectioning. In many inspection applications of metallic surfaces an acceptable intensity image cannot be produced. Neither with bright field, dark field lighting nor with diffuse illumination. Surface defects with three dimensional characteristic, e.g. cavities, scratches, nicks, are visualized with a higher contrast by means of range imaging. One advantage of geometric inspection is that surface height information is represented explicitly which is less influenced by a change in the reflection factor across the surface. This paper deals with the inspection of rolled steel blocks which may be covered with scale. Due to the strongly varying reflectance properties of the A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 849–856, 2004. © Springer-Verlag Berlin Heidelberg 2004
850
I. Reindl and P. O’Leary
surface traditional intensity imaging methods fail and give poor performance. Hence, range imaging based on fast light sectioning is used to acquire the three dimensional geometry of the steel block, with its embedded flaws. Afterwards, the acquired surface data is analyzed with respect to cavities on the surface. Due to the vibration of the steel blocks on the conveyor the acquired sections need to be registered to assemble the three dimensional surface. Basically, three different approaches for defect detection are treated, one line-wise and two area based.
2
Problem Statement
The study concentrates on the detection of flaws on steel blocks which are partially covered with scale. The flaws are embedded in a relative smooth surface and have a distinct three dimensional geometry. The cross-section of the steel block - a size of approximately 130 mm by 130 mm - is approximately quadratic; the length is approximately 10 meters; defects with a width and depth of 0.2 mm and a length of a few millimeters have to be identified on moving blocks at a speed of
3
Principle of Operation
The light sectioning method is a well-known measurement technique for optical determination of object sections [7,8] and needs no further explanation here. The principle of operation of the geometric surface inspection is relatively simple: whereby, the difficulties lie in the detail. Multiple sections are acquired at high speed - typical separation on the surface is in the order of millimeters. The sections are assembled to form a surface: a surface model is approximated; deviations from the model are located and analyzed as being potential defects. Fast light sectioning devices which deliver a few thousand sections per second are available [9]. The principle of light sectioning is shown in Figure 1. A more detailed introduction into light-sectioning can be found in [7] and [8].
4
Algorithms for Surface Approximation
At this point it is important to note, that highly efficient methods are required since 1500 sections per second must be analyzed in real-time. Each of the following subsections gives a description of an approximation algorithm and briefly summarizes the experimental results. Three different methods are treated; whereby, the first algorithm is based on a line-wise examination of the acquired profiles. The second and third methods refers to segments of the surface area.
Geometric Surface Inspection of Raw Milled Steel Blocks
851
Fig. 1. Principle of the light sectioning
4.1
Unwrapping of the Surface Using Spline Interpolation
The measuring head for light sectioning is focused on the edge of the steel block, since this is the portion most prone to cracks. This means that the round edge is in the center of the acquired profile. Unfortunately, the round edge modeled through an ellipse and the adjoining planes modeled by two tangent lines may not be accurate. Especially, the transition between the ellipse and the adjacent lines may give high deviations due to poor alignment of the mill rolls. Hence, the profile is approximated with splines to form a model in a first step. Secondly, from the data points the distance orthogonal to the spline model is determined and assembled for a particular number of profiles as a matrix. This step implies an unwrapping of the surface. The distance is computed by determining
Fig. 2. Sketch of an orthogonal distance.
the orthogonal distance of the local tangent line of the spline model to the acquired profile. The tangent line is fixed by two neighboring points and which are moved along the spline approximation of the profile.
I. Reindl and P. O’Leary
852
From this tangent line the orthogonal distance to the data point which lies on the acquired profile is computed. This is shown in a sketch in Figure 2. These three points from a triangle with a certain area which can be determined as half of the magnitude of the cross product of the corresponding vectors. Hence, the area of the triangle is
The orthogonal distance and
where
of is
to the local tangent spanned through
is the length of the baseline (local tangent) of the triangle
This orthogonal distance is computed for each range value of the profile and shown for a segment of the surface, with an embedded flaw in Figure 3.
Fig. 3. Orthogonal distance of the modeled range data: (a) Surface segment of the rounded edge of a steel block with an embedded flaw, (b) Orthogonal distance of a spline approximation model to the measured range data.
4.2
Unwrapping of the Surface Using SVD
As an alternative to the line-based method discussed in the previous section, this method is referring to segments of the assembled surface model. The surface can be divided into blocks (each referred to as the matrix A) of the dimension
Geometric Surface Inspection of Raw Milled Steel Blocks
853
In a first step an approximation of the surface [6] using singular value decomposition (SVD) [4,5] is determined. Basically, there exist many algorithms for approximation a surface, however this approach is straightforward and easy to implement. The surface is composed as
where
and
are orthogonal matrices, and
is composed of the singular values ordered from the largest to the smallest. The SVD of A is closely related to the eigencomposition of and where the singular values of A are non-negative square roots of the eigenvalues of In case of simple surface shapes which are either planar or curved in one direction, the first largest singular value captures the largest portion of the variance. In the case of perturbation of the surface data caused by a flaw a large amount of the surface shape is modeled by more additional singular values. Hence, the idea is to reassemble a smooth surface approximation by dropping all terms from the SVD that corresponding to small singular values. This means that all singular values except the first ones are set to zero,
This is equivalent to the Eckhart-Young theorem to approximate a matrix by another matrix of lower rank. The smoothed surface approximation is obtained as, The result of the smoothed surface approximant of the data in Figure 3a is shown in Figure 5.
4.3
Unwrapping the Surface Using a Fast Method of 2D-Approximation
This method builds a fast alternative to a real 2-dimensional polynomial approximation of the surface. The idea is to fit the segments by using polynomials in one direction. In a second step the coefficients of the polynomials can be fitted to obtain a smoothed approximation of the surface. In a first step the surface can be approximated by a passel of parallel and independend polynomials (Fig. 4) with order determined as
where
are the coefficients of the
polynomial, written as
854
I. Reindl and P. O’Leary
Fig. 4. Polynomials in X-direction
To discretize the surface in x-direction leads into
The coefficients of each polynomial least square error,
pairs of data
can found using the method of the
where the coefficient matrix
is the solution of the least square problem
In the second step the coefficients of A have to been approximated order by order (rows of A) with order to satisfy the term
where The least square approximation applied to
where
Geometric Surface Inspection of Raw Milled Steel Blocks
855
results in the coefficient matrix The substitution of (15) in (8) leads into a function which represents a 2-dimensional polynomial approximation to the surface
The sensitivity of the approximation in both directions x and y can be forced by choosing appropriate values for the maximum orders (x-direction) and (y-direction), whereby Analogous to (15) and (19) it is also possible to fit the polynomials in y-direction
Fig. 5. (a) SVD approximation of the surface using the three largest singular values, (b) Fast polynomial approximation
by substitution of x and y,
and
A fast algorithm for a high-ordered polynomial approximation in x and y can be found by combining (19) and (21), where is the approximation to the dataset and is applied to the first approximation For low ordered first step approximation and high ordering second step approximation of the coefficients the number of the time-consuming second step is limited by the number of coefficients of the maximum orders and
856
5
I. Reindl and P. O’Leary
Results
A prototype surface inspection system was developed and deployed at a steel production plant. Testing of the system showed that the area based methods are superior to the line based methods. As is to be expected; the SVD and polynomial approximation methods each have their own weakness and strengths: 1. SVD: Proved good for the detection of errors on simply structured surfaces, e.g. the detection of small spherical errors on otherwise good surfaces. The orientation of the defect relative to the SVD is relevant. 2. Polynomial approximation: Is superior for the defect detection of errors on surfaces which are more strongly structured. This, for example, is the case when the motion of the surface is subject to strong vibration. Furtherwise, the regularisation used in the polynomial approximation, leads to good detection of errors which have a large aspect ratio.
6
Conclusion
Geometric surface inspection require numerically efficient methods for surface modelling, detection and classification of errors. Simultaneous application of SVD and polynomial approximation proved very successful, whereby, each method has been optimized to specific types of defects.
References 1. PERNKOPF, F.: Image acquisition techniques for automatic visual inspection of metallic surfaces, NDT & E International, 36(8):609-617, 2003. 2. NEWMAN, T.S., JAIN, A.K.: A survey of automated visual inspection, Computer Vision and Image Understanding, 61(2):231-262, 1995. 3. STEFANI, S.A., NAGARAJAH, C.R., WILLGROSS, R.: Surface inspection technique for continuously extruded cylindrical products. Measurement Science & Technology, 10:N21-N25, 1999. 4. DATTA, B.N.: Numerical linear algebra and applications. Brooks & Cole Publishing, 1995. 5. NASH, J.C.: Compact numerical methods for computers: linear algebra and function minimisation. Adam Hilger Ltd., 1979. 6. LONG, A.E., LONG, C.A.: Surface approximation and interpolation via matrix svd. The College Mathematics Journal, 32(l):20-25, 2001. 7. KANADE, T. (ed.): Three-dimensional machine vision, Kluwer Academic Publishers, 1987. 8. JOHANNESSON, M.: SIMD architectures for range and radar imaging. PhD thesis, University of Linköping, 1995. 9. Company IVP (Integrated Vision Products): IVP Ranger SAH5 product information, URL: www.ivp.se
Author Index
Abad, Francisco I-688 Abdel-Dayem, Amr R. II-191 Adán, Antonio II-33 Aguiar, Rui II-158 Ahmed, Maher I-368, I-400 Ahn, Sang Chul I-261 Al Shaher, Abdullah I-335 Al-Mazeed, Ahmad II-363 Alajlan, Naif I-139, I-745 Alba-Castro, José Luis II-323, 660 Alegre, Enrique II-589 Alemán-Flores, Miguel II-339 Alexander, Simon K. I-236 Álvarez-León, Luis II-339 Alves, E. Ivo II-489 Ampornaramveth, V. II-530 Angel, L. I-705 Antequera, T. II-150 Ascenso, João I-588 Atine, Jean-Charles I-769 Atkinson, Gary I-621 Austin, Jim II-684 Ávila, Bruno Tenório II-234, II-249 Ávila, J.A. II-150 Azhar, Hannan Bin I-556 Azimifar, Zohreh II-331 Baek, Sunkyoung I-471 Bailly, G. II-100 Bak, EunSang I-49 Bandeira, Lourenço II-226 Banerjee, A. II-421 Banerjee, N. II-421 Barata, Teresa II-489 Barreira, N. II-43 Batista, Jorge P. II-552 Batouche, Mohamed I-147 Bedini, Luigi II-241 Beirão, Céu L. II-841 Belhadj-aissa, Aichouche I-866 Berar, M. II-100 Bernardino, Alexandre I-538, II-454 Bevilacqua, Alessandro II-481 Bhuiyan, M.A. I-530 Borgosz, Jan I-721
Bouchemakh, Lynda I-866 I-285 Brahma, S. II-421 Brassart, Eric II-471 Breckon, Toby P. I-680 Bres, Stéphane I-825 Brun, Luc I-840 Bruni, V. I-179 Bueno, Gloria II-33 Bui, Tien D. I-82 Caderno, I.G. II-132 Caldas Pinto, João R. I-253, II-226, II-802 Calpe-Maravilla, J. II-429 Camahort, Emilio I-688 Campilho, Ana II-166 Campilho, Aurélio II-59, II-108, II-158, II-166, II-372 Camps-Vails, G. II-429 Carmona-Poyato, A. I-424 Caro, A. II-150 Carreira, M.J. I-212, II-132 Castelán, Mario I-613 Castrillon, M. II-725 Cazes, T.B. II-389 Chanda, Bhabatosh II-217 Chen, Jia-Xin II-581 Chen, Mei I-220 Chen, Xinjian I-360 Chen, Yan II-200 Chen, Ying I-269 Chen, Zezhi I-638 Cherifi, Hocine I-580, II-289 Chi, Yanling I-761 Cho, Miyoung I-471 Cho, Sang-Hyun II-597 Choe, J. II-446 Choe, Jihwan I-597 Choi, E. II-446 Chowdhury, S.P. II-217 Chung, Yongwha II-770 Civanlar, Reha I-285 Clérentin, Arnaud II-471 Cloppet, F. II-84
858
Author Index
Conte, D. II-614 Cooray, Saman II-741 Cordeiro, Viviane I-187 Cordova, M.S. II-834 Corkidi, G. II-834 Correia Miguel V. II-372, II-397 Cosío, Fernando Arámbula II-76 Császár, Gergely I-811 Csordás, I-811 Cyganek, Boguslaw I-721 Czúni, László I-811 Dang, Anrong I-195, I-269 Das, A.K. II-217 Dawood, Mohammad II-544 De Backer, Steve II-497 de Mello, Carlos A.B. II-209 De Santo, M. I-564 de With, Peter H.N. II-651 Debruyn, Walter II-497 Dejnozkova, Eva I-416 Delahoche, Laurent II-471 Denis, Nicolas I-318 Deniz, O. II-725 Desvignes, M. II-100 Dikici, I-285 Dimond, Keith I-556 Dios, J.R. Martinez-de I-90, I-376 Ditrich, Frank I-629 Di Stefano, Luigi I-408, II-437, II-481 Doguscu, Sema I-432 Dokladal, Petr I-416 Domínguez, Sergio I-318, I-833 Dopico, Antonio G. II-397 Dosil, Raquel I-655 Doulaverakis, C. I-310 Draa, Amer I-147 du Buf, J.M. Hans I-664 Durán, M.L. II-150 El Hassouni, Mohammed I-580 El Rube’, Ibrahim I-368 El-Sakka, Mahmoud R. II-191,II-759 Elarbi Boudihir, M. II-563 Falcon, A. II-725 Fang, Jianzhong I-503 Fathy, Mahmood II-623 Faure, A. II-84 Fdez-Vidal, Xosé R. I-655
Feitosa, R.Q. II-389 Feng, Xiangchu I-479 Feng, Xiaoyi II-668 Fernández, Cesar I-547 Fernández, J.J. II-141 Fernández-García, N.L. I-424 Ferreiro-Armán, M. II-323 Fieguth, Paul I-9, I-114, I-163, I-236, I-572, I-745, II-314, II-331 Figueiredo, Mário A. T. II-841 II-298 Fisher, Mark I-848, I-858 Fisher, Robert B. I-680 Flusser, Jan I-122 Foggia, P. II-614 Galindo, E. II-834 Galinski, Grzegorz I-729 Gao, Song I-82 Gao, Wen II-520, II-778 Gao, Xinbo I-74, II-381 Garcia, Bernardo II-166 Garcia, Christophe II-717 García, D. I-705 García, I. II-141 García-Pérez, David I-795 García-Sevilla, Pedro I-25 Ghaffar, Rizwan II-512 Glory, E. II-84 Gómez-Chova, L. II-429 Gomez-Ulla, F. II-132 Gonçalves, Paulo J. Sequeira I-253 González, F. II-132 González, J.M. I-705 González-Jiménez, Daniel II-660 Gou, Shuiping I-41 Gregson, Peter H. I-130 Gu, Junxia II-381 Guidobaldi, C. II-614 Guimarães, Leticia I-187 Gunn, Steve II-363 Hadid, Abdenour II-668 Hafiane, Adel I-787 Haindl, Michal II-298, II-306 Hamou, Ali K. II-191 Han, Dongil I-384 Hancock, Edwin R. I-327, I-335, I-352, I-613, I-621, II-733 Hanson, Allen I-519
Author Index Hao, Pengwei I-195, I-269 Hasanuzzaman, M. I-530 Havasi, Laszlo II-347 Hernández, Sergio II-826 Heseltine, Thomas II-684 Hotta, Kazuhiro II-405 Howe, Nicholas R. I-803 Huang, Xiaoqiang I-848, I-858 Ideses, Ianir II-273 Iivarinen, Jukka I-753 Izri, Sonia II-471 Jafri, Noman II-512 Jalba, Andrei C. I-1 Jamzad, Mansour II-794 Jeong, Pangyu I-228 Jeong, T. II-446 Jernigan, Ed I-139, I-163, II-331 Ji, Hongbing I-74 Jia, Ying II-572 Jiang, Xiaoyi II-544 Jiao, Licheng I-41, I-455, I-479, I-487, II-504 Jin, Fu I-572 Jin, Guoying I-605 Jung, Kwanho I-471 Kabir, Ehsanollah I-818 Kamel, Mohamed I-244, I-368, I-400, I-745, II-25, II-51 Kang, Hang-Bong II-597 Kangarloo, Kaveh I-818 Kartikeyan, B. II-421 Kempeneers, Pieter II-497 Khan, Shoab Ahmed II-512 Khelifi, S.F. II-563 Kim, Hyoung-Gon I-261 Kim, Ig-Jae I-261 Kim, Kichul II-770 Kim, Min II-770 Kim, Pankoo I-471 Kim, Tae-Yong II-528, II-536 Kobatake, Hidefumi I-697 Kong, Hyunjang I-471 Koprnicky, Miroslav I-400 Kourgli, Assia I-866 Kucharski, Krzysztof I-511 Kumazawa, Itsuo II-9 Kutics, Andrea I-737
Kwon, Yong-Moo I-261 Kwon, Young-Bin I-392 Lam, Kin-Man I-65 Landabaso, Jose-Luis II-463 Lanza, Alessandro II-481 Laurent, Christophe II-717 Lee, Chulhee I-597, II-446 Lee, Seong-Whan II-536 Lee, Tae-Seong I-261 Lefèvre, Sébastien II-606 Leung, Maylor K.H. I-761 Le Troter, Arnaud II-265 Li, Gang I-171 Li, Jie II-381 Li, Minglu II-116 Li, Xin II-572 Li, Yang II-733 Li, Yanxia II-200 Liang, Bojian I-638 Lieutaud, Simon I-778 Limongiello, A. II-614 Lins, Rafael Dueire II-175, II-234, II-249 Lipikorn, Rajalida I-697 Liu, Kang I-487 Liu, Shaohui II-520, II-778 Liu, Yazhou II-520, II-778 Lotfizad, A. Mojtaba II-623 Lukac, Rastislav I-155, II-1, II-124, II-281 Luo, Bin I-327 Ma, Xiuli I-455 Madeira, Joaquim II-68 Madrid-Cuevas, F.J. I-424 Majumdar, A.K. I-33 Majumder, K.L. II-421 Mandal, S. II-217 Manuel, João II-92 Marengoni, Mauricio I-519 Marhic, Bruno II-471 Mariño, C. II-132 Marques, Jorge S. I-204 Martín-Guerrero, J.D. II-429 Martín-Herrero, J. II-323 Martínez-Albalá, Antonio II-33 Martínez-Usó, Adolfo I-25 Mattoccia, Stefano I-408, II-437 Mavromatis, Sebastien II-265
859
860
Author Index
McDermid, John I-638 McGeorge, Peter I-295 Meas-Yedid, V. II-84 Medina, Olaya II-818 Medina-Carnicer, R. I-424 Melo, José II-454 Mendez, J. II-725 Mendonça, Ana Maria II-108, II-158 Mendonça, L.F. I-253 Mery, Domingo I-647, II-818, II-826 Meyer, Fernand I-840 Meynet, Julien II-709 Micó, Luisa I-440 Mikeš, Stanislav II-306 Mirmehdi, Majid I-212, II-810 Mochi, Matteo II-241 Mohamed, S.S. II-51 Mohan, M. I-33 Mola, Martino II-437 Moon, Daesung II-770 Moreira, Rui II-108 Moreno, J. II-429 Moreno, Plinio I-538 Mosquera, Antonio I-795 Mota, G.L.A. II-389 Naftel, Andrew II-454 Nair, P. II-421 Nakagawa, Akihiko I-737 Nedevschi, Sergiu I-228 Neves, António J.R. I-277 Nezamoddini-Kachouie, Nezamoddin I-163 Nicponski, Henry II-633 Nixon, Mark II-363 Nourine, R. II-563 Nunes, Luis M. II-397 O’Connor, Noel II-741 O’Leary, Paul II-849 Ochoa, Felipe I-647 Oh, Sang-Rok I-384 Oliver, Gabriel I-672 Olivo-Marin, J-Ch. II-84 Ollero, A. I-90, I-376 Ortega, Marcos I-795 Ortigosa, P.M. II-141 Ortiz, Alberto I-672 Ouda, Abdelkader H. II-759
Paiva, António R.C. I-302 Palacios, R. II-150 Palma, Duarte I-588 Pan, Sung Bum II-770 Pardas, Montse II-463 Pardo, Xosé M. I-655 Park, Hanhoon II-700 Park, Jaehwa I-392 Park, Jihun II-528, II-536 Park, Jong-Il II-700 Park, Sunghun II-528 Pastor, Moisés II-183 Pavan, Massimiliano I-17 Payá Luis I-547 Payan, Y. II-100 Pears, Nick E. I-638, II-684 Pelillo, Marcello I-17 Penas, M. I-212 Penedo, Manuel G. I-212, I-795, II-43,II-132 Peng, Ning-Song II-581 Percannella, G. I-564 Pereira, Fernando I-588 Petrakis, E. I-310 Pezoa, Jorge E. II-413 Pietikäinen, Matti II-668 Pimentel, Luís II-226 Pina, Pedro II-226, II-489 Pinho, Armando J. I-277, I-302 Pinho, Raquel Ramos II-92 Pinset, Ch. II-84 Pla, Filiberto I-25 Plataniotis, Konstantinos N. II-1, II-281 Podenok, Leonid P. I-447 Popovici, Vlad II-709
Qin, Li II-17 Qiu, Guoping I-65, I-503 Ramalho, Mário II-226 Ramel, J.Y. II-786 Ramella, Giuliana I-57 Rautkorpi, Rami I-753 Redondo, J.L. II-141 Reindl, Ingo II-849 Reinoso, Oscar I-547 Richardson, Iain I-295 Ricketson, Amanda I-803 Rico-Juan, Juan Ramón I-440
Author Index Riseman, Edward I-519 Rital, Soufiane II-289 Rivero-Moreno, Carlos Joel Rizkalla, K. II-51 Robles, Vanessa II-589 Rodrigues, João I-664 Rodríguez, P.G. II-150 Roerdink, Jos B.T.M. I-1 Rueda, Luis II-17 Ryoo, Seung Taek I-98
I-825
Sabri, Mahdi II-314 Sadykhov, Rauf K. I-447 Sáez, Doris II-826 Sahin, Turker I-495, II-355 Sahraie, Arash I-295 Salama, M.M.A. I-244, II-25, II-51 Salerno, Emanuele II-241 Samokhval, Vladimir A. I-447 Sanches, João M. I-204 Sánchez, F. I-705 San Pedro, José I-318 Sanniti di Baja, Gabriella I-57 Sansone, C. I-564 Santos, Beatriz Sousa II-68 Santos, Jorge A. II-397 Santos-Victor, José I-538, II-454 Saraiva, José II-489 Sarkar, A. II-421 Schaefer, Gerald I-778, II-257 Schäfers, Klaus P. II-544 Scheres, Ben II-166 Scheunders, Paul II-497 Seabra Lopes, Luís I-463 Sebastián, J.M. I-705, II-589 Sener, Sait I-344 Sequeira, Jean II-265 Serrano-López, A.J. II-429 Shan, Tan I-479, II-504 Shimizu, Akinobu I-697 Shirai, Yoshiaki I-530 Silva, Augusto II-68 Silva, José Silvestre II-68 Skarbek, I-511, I-729 Smolka, Bogdan I-155, II-1, II-124, II-281 Soares, André I-187 Song, Binheng I-171 Sousa, António V. II-158 Sousa, João M.C. I-253, II-802
861
Sroubek, Filip I-122 Stamon, G. II-84 Suesse, Herbert I-629 Sun, Luo II-572 Sun, Qiang I-41 Sun, Yufei II-200 Sural, Shamik I-33 Susin, Altamiro I-187 Sziranyi, Tamas II-347 Szlavik, Zoltan II-347 Taboada, B. II-834 Takahashi, Haruhisa II-405 Talbi, Hichem I-147 Tao, Linmi I-605, II-572 Tavares, R.S. II-92 Tax, David M.J. I-463 Thiran, Jean-Philippe II-709 Thomas, Barry T. I-212, II-810 Tian, Jie I-360 Tombari, Federico I-408 Tonazzini, Anna II-241 Torres, Sergio N. II-413 Toselli, Alejandro II-183 Traver, V. Javier I-538 Tsui, Hung Tat I-713 Tsunekawa, Takuya II-405 Twardowski, Tomasz I-721 Ueno, H. I-530 Unel, Mustafa I-344, I-432, I-495, II-355 Uvarov, Andrey A. I-447 Vadivel, A. I-33 Vagionitis, S. I-310 Vautrot, Philippe I-840 Vega-Alvarado, L. II-834 Venetsanopoulos, Anastasios N. Vento, M. I-564, II-614 Vicente, M. Asunción I-547 Vidal, Enrique II-183 Vidal, René I-647 Vincent, Nicole II-606, II-786 Vinhais, Carlos II-59 Visani, Muriel II-717 Vitulano, D. I-179 Vivó, Roberto I-688 Voss, Klaus I-629 Vrscay, Edward R. I-236
II-1
862
Author Index
Wang, Lei I-74 Wang, Lijun II-520, II-778 Wang, QingHua I-463 Wang, Yuzhong I-I06 Wesolkowski, Slawo I-9 Wilkinson, Michael H.F. I-1 Wilson, Richard C. I-327 Winger, Lowell I-572 Wirotius, M. II-786 Wnukowicz, Karol I-729 Wong, Kwan-Yee Kenneth II-676 Wong, Shu-Fai II-676 Xiao, Bai I-352 Xie, Jun I-713 Xie, Xianghua II-810 Xu, Guangyou I-605, II-572 Xu, Li-Qun II-463 Xu, Qianren I-244, II-25 Yaghmaee, Farzin II-794 Yang, Jie I-106, II-581 Yang, Xin I-360, II-643, II-692 Yano, Koji II-9
Yao, Hongxun II-520, II-778 Yaroslavsky, Leonid II-273 Yazdi, Hadi Sadoghi II-623 Yi, Hongwen I-130 Yin, Jianping II-750 You, Bum-Jae I-384 Yu, Hang I-352 Zavidovique, Bertrand I-787 Zervakis, M. I-310 Zhang, Chao I-195 Zhang, Guomin II-750 Zhang, Tao I-530 Zhang, Xiangrong II-504 Zhang, Yuzhi II-200 Zhao, Yongqiang II-116 Zhong, Ying I-295 Zhou, Dake II-643, II-692 Zhou, Yue I-106 Zhu, En II-750 Zhu, Yanong I-848 Zilberstein, Shlomo I-519 Zuo, Fei II-651
This page intentionally left blank
This page intentionally left blank
Lecture Notes in Computer Science For information about Vols. 1–3136 please contact your bookseller or Springer
Vol. 3263: M. Weske, P. Liggesmeyer (Eds.), ObjectOriented and Internet-Based Technologies. XII, 239 pages. 2004.
Vol. 3223: K. Slind, A. Bunker, G. Gopalakrishnan (Eds.), Theorem Proving in Higher Order Logics. VIII, 337 pages. 2004.
Vol. 3260: I. Niemegeers, S.H. de Groot (Eds.), Personal Wireless Communications. XIV, 478 pages. 2004.
Vol. 3221: S. Albers, T. Radzik (Eds.), Algorithms – ESA 2004. XVIII, 836 pages. 2004.
Vol. 3258: M. Wallace (Ed.), Principles and Practice of Constraint Programming – CP 2004. XVII, 822 pages. 2004.
Vol. 3220: J.C. Lester, R.M. Vicari, F. Paraguaçu (Eds.), Intelligent Tutoring Systems. XXI, 920 pages. 2004.
Vol. 3256: H. Ehrig, G. Engels, F. Parisi-Presicce (Eds.), Graph Transformations. XII, 451 pages. 2004.
Vol. 3217: C. Barillot, D.R. Haynor, P. Hellier (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004. XXXVIII, 1114 pages. 2004.
Vol. 3255: A. Benczúr, J. Demetrovics, G. Gottlob (Eds.), Advances in Databases and Information Systems. XI, 423 pages. 2004.
Vol. 3216: C. Barillot, D.R. Haynor, P. Hellier (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2004. XXXVIII, 930 pages. 2004.
Vol. 3254: E. Macii, V. Paliouras, O. Koufopavlou (Eds.), Integrated Circuit and System Design. XVI, 910 pages. 2004.
Vol. 3212: A. Campilho, M. Kamel (Eds.), Image Analysis and Recognition, Part II. XXIX, 862 pages. 2004.
Vol. 3253: Y. Lakhnech, S. Yovine (Eds.), Formal Techniques in Timed, Real-Time, and Fault-Tolerant Systems. X, 397 pages. 2004. Vol. 3250: L.-J. (LJ) Zhang, M. Jeckle (Eds.), Web Services. X, 300 pages. 2004. Vol. 3249: B. Buchberger, J.A. Campbell (Eds.), Artificial Intelligence and Symbolic Computation. X, 285 pages. 2004. (Subseries LNAI). Vol. 3246: A. Apostolico, M. Melucci (Eds.), String Processing and Information Retrieval. XIV, 316 pages. 2004. Vol. 3242: X. Yao, E. Burke, J.A. Lozano, J. Smith, J.J. Merelo-Guervós, J.A. Bullinaria, J. Rowe, A. Kabán, H.-P. Schwefel (Eds.), Parallel Problem Solving from Nature - PPSN VIII. XX, 1185 pages. 2004. Vol. 3241: D. Kranzlmüller, P. Kacsuk, J.J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. XIII, 452 pages. 2004. Vol. 3240: I. Jonassen, J. Kim (Eds.), Algorithms in Bioinformatics. IX, 476 pages. 2004. (Subseries LNBI). Vol. 3239: G. Nicosia, V. Cutello, P.J. Bentley, J. Timmis (Eds.), Artificial Immune Systems. XII, 444 pages. 2004. Vol. 3238: S. Biundo, T. Frühwirth, G. Palm (Eds.), KI 2004: Advances in Artificial Intelligence. XI, 467 pages. 2004. (Subseries LNAI). Vol. 3232: R. Heery, L. Lyon (Eds.), Research and Advanced Technology for Digital Libraries. XV, 528 pages. 2004. Vol. 3229: J.J. Alferes, J. Leite (Eds.), Logics in Artificial Intelligence. XIV, 744 pages. 2004. (Subseries LNAI).
Vol. 3211: A. Campilho, M. Kamel (Eds.), Image Analysis and Recognition, Part I. XXIX, 880 pages. 2004. Vol. 3210: J. Marcinkowski, A. Tarlecki (Eds.), Computer Science Logic. XI, 520 pages. 2004. Vol. 3208: H.J. Ohlbach, S. Schaffert (Eds.), Principles and Practice of Semantic Web Reasoning. VII, 165 pages. 2004. Vol. 3207: L.T. Yang, M. Guo, G.R. Gao, N.K. Jha (Eds.), Embedded and Ubiquitous Computing. XX, 1116 pages. 2004. Vol. 3206: P. Sojka, I. Kopecek, K. Pala (Eds.), Text, Speech and Dialogue. XIII, 667 pages. 2004. (Subseries LNAI). Vol. 3205: N. Davies, E. Mynatt, I. Siio (Eds.), UbiComp 2004: Ubiquitous Computing. XVI, 452 pages. 2004. Vol. 3203: J. Becker, M. Platzner, S. Vernalde (Eds.), Field Programmable Logic and Application. XXX, 1198 pages. 2004. Vol. 3202: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Knowledge Discovery in Databases: PKDD 2004. XIX, 560 pages. 2004. (Subseries LNAI). Vol. 3201: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine Learning: ECML 2004. XVIII, 580 pages. 2004. (Subseries LNAI). Vol. 3199: H. Schepers (Ed.), Software and Compilers for Embedded Systems. X, 259 pages. 2004. Vol. 3198: G.-J. de Vreede, L.A. Guerrero, G. Marín Raventós (Eds.), Groupware: Design, Implementation and Use. XI, 378 pages. 2004.
Vol. 3225: K. Zhang, Y. Zheng (Eds.), Information Security. XII, 442 pages. 2004.
Vol. 3194: R. Camacho, R. King, A. Srinivasan (Eds.), Inductive Logic Programming. XI, 361 pages. 2004. (Subseries LNAI).
Vol. 3224: E. Jonsson, A. Valdes, M. Almgren (Eds.), Recent Advances in Intrusion Detection. XII, 315 pages. 2004.
Vol. 3193: P. Samarati, P. Ryan, D. Gollmann, R. Molva (Eds.), Computer Security – ESORICS 2004. X, 457 pages. 2004.
Vol. 3192: C. Bussler, D. Fensel (Eds.), Artificial Intelligence: Methodology, Systems, and Applications. XIII, 522 pages. 2004. (Subseries LNAI). Vol. 3191: M. Klusch, S. Ossowski, V. Kashyap, R. Unland(Eds.), Cooperative Information Agents VIII. XI, 303 pages. 2004. (Subseries LNAI). Vol. 3190: Y. Luo (Ed.), Cooperative Design, Visualization, and Engineering. IX, 248 pages. 2004. Vol. 3189: P.-C. Yew, J. Xue (Eds.), Advances in Computer Systems Architecture. XVII, 598 pages. 2004. Vol. 3187: G. Lindemann, J. Denzinger, I.J. Timm, R. Unland (Eds.), Multiagent System Technologies. XIII, 341 pages. 2004. (Subseries LNAI). Vol. 3186: Z. Bellahsène, T. Milo, M. Rys, D. Suciu, R. Unland (Eds.), Database and XML Technologies. X, 235 pages. 2004. Vol. 3185: M. Bernardo, F. Corradini (Eds.), Formal Methods for the Design of Real-Time Systems. VII, 295 pages. 2004. Vol. 3184: S. Katsikas, J. Lopez, G. Pernul (Eds.), Trust and Privacy in Digital Business. XI, 299 pages. 2004. Vol. 3183: R. Traunmüller (Ed.), Electronic Government. XIX, 583 pages. 2004. Vol. 3182: K. Bauknecht, M. Bichler, B. Pröll (Eds.), ECommerce and Web Technologies. XI, 370 pages. 2004. Vol. 3181: Y. Kambayashi, M. Mohania, W. Wöß (Eds.), Data Warehousing and Knowledge Discovery. XIV, 412 pages. 2004. Vol. 3180: F. Galindo, M. Takizawa, R. Traunmüller (Eds.), Database and Expert Systems Applications. XXI, 972 pages. 2004. Vol. 3179: F.J. Perales, B.A. Draper (Eds.), Articulated Motion and Deformable Objects. XI, 270 pages. 2004. Vol. 3178: W. Jonker, M. Petkovic (Eds.), Secure Data Management. VIII, 219 pages. 2004.
Vol. 3162: R. Downey, M. Fellows, F. Dehne (Eds.), Parameterized and Exact Computation. X, 293 pages. 2004. Vol. 3160: S. Brewster, M. Dunlop (Eds.), Mobile HumanComputer Interaction – MobileHCI 2004. XVII, 541 pages. 2004. Vol. 3159: U. Visser, Intelligent Information Integration for the Semantic Web. XIV, 150 pages. 2004. (Subseries LNAI). Vol. 3158: I. Nikolaidis, M. Barbeau, E. Kranakis (Eds.), Ad-Hoc, Mobile, and Wireless Networks. IX, 344 pages. 2004. Vol. 3157: C. Zhang, H. W. Guesgen, W.K. Yeap (Eds.), PRICAI 2004: Trends in Artificial Intelligence. XX, 1023 pages. 2004. (Subseries LNAI). Vol. 3156: M. Joye, J.-J. Quisquater (Eds.), Cryptographic Hardware and Embedded Systems - CHES 2004. XIII, 455 pages. 2004. Vol. 3155: P. Funk, P.A. González Calero (Eds.), Advances in Case-Based Reasoning. XIII, 822 pages. 2004. (Subseries LNAI). Vol. 3154: R.L. Nord (Ed.), Software Product Lines. XIV, 334 pages. 2004. Vol. 3153: J. Fiala, V. Koubek, J. Kratochvíl (Eds.), Mathematical Foundations of Computer Science 2004. XIV, 902 pages. 2004. Vol. 3152: M. Franklin (Ed.), Advances in Cryptology – CRYPTO 2004. XI, 579 pages. 2004. Vol. 3150: G.-Z. Yang, T. Jiang (Eds.), Medical Imaging and Augmented Reality. XII, 378 pages. 2004. Vol. 3149: M. Danelutto, M. Vanneschi, D. Laforenza (Eds.), Euro-Par 2004 Parallel Processing. XXXIV, 1081 pages. 2004. Vol. 3148: R. Giacobazzi (Ed.), Static Analysis. XI, 393 pages. 2004.
Vol. 3177: Z.R. Yang, H. Yin, R. Everson (Eds.), Intelligent Data Engineering and Automated Learning – IDEAL 2004. XVIII, 852 pages. 2004.
Vol. 3147: H. Ehrig, W. Damm, J. Desel, M. Große-Rhode, W. Reif, E. Schnieder, E. Westkämper (Eds.), Integration of Software Specification Techniques for Applications in Engineering. X, 628 pages. 2004.
Vol. 3176: O. Bousquet, U. von Luxburg, G. Rätsch (Eds.), Advanced Lectures on Machine Learning. IX, 241 pages. 2004. (Subseries LNAI).
Vol. 3146: P. Érdi.A. Esposito, M. Marinaro, S. Scarpetta (Eds.), Computational Neuroscience: Cortical Dynamics. XI, 161 pages. 2004.
Vol. 3175: C.E. Rasmussen, H.H. Bülthoff, B. Schölkopf, M.A. Giese (Eds.), Pattern Recognition. XVIII, 581 pages. 2004.
Vol. 3144: M. Papatriantafilou, P. Hunel (Eds.), Principles of Distributed Systems. XI, 246 pages. 2004.
Vol. 3174: F. Yin, J. Wang, C. Guo (Eds.), Advances in Neural Networks - ISNN 2004. XXXV, 1021 pages. 2004. Vol. 3173: F. Yin, J. Wang, C. Guo (Eds.), Advances in Neural Networks– ISNN 2004. XXXV, 1041 pages. 2004. Vol. 3172: M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, T. Stützle (Eds.), Ant Colony, Optimization and Swarm Intelligence. XII, 434 pages. 2004. Vol. 3170: P. Gardner, N. Yoshida (Eds.), CONCUR 2004 - Concurrency Theory. XIII, 529 pages. 2004. Vol. 3166: M. Rauterberg (Ed.), Entertainment Computing – ICEC 2004. XXIII, 617 pages. 2004. Vol. 3163: S. Marinai, A. Dengel (Eds.), Document Analysis Systems VI. XI, 564 pages. 2004.
Vol. 3143: W. Liu, Y. Shi, Q. Li (Eds.), Advances in WebBased Learning – ICWL 2004. XIV, 459 pages. 2004. Vol. 3142: J. Diaz, J. Karhumäki, A. Lepistö, D. Sannella (Eds.), Automata, Languages and Programming. XIX, 1253 pages. 2004. Vol. 3140: N. Koch, P. Fratemali, M. Wirsing (Eds.), Web Engineering. XXI, 623 pages. 2004. Vol. 3139: F. Iida, R. Pfeifer, L. Steels, Y. Kuniyoshi (Eds.), Embodied Artificial Intelligence. IX, 331 pages. 2004. (Subseries LNAI). Vol. 3138: A. Fred, T. Caelli, R.P.W. Duin, A. Campilho, D.d. Ridder (Eds.), Structural, Syntactic, and Statistical Pattern Recognition. XXII, 1168 pages. 2004. Vol. 3137: P. De Bra, W. Nejdl (Eds.), Adaptive Hypermedia and Adaptive Web-Based Systems. XIV, 442 pages. 2004.