Empirical Evaluation Methods in Computer Vision

EMPIRICAL EVALUATION METHODS IN COMPUTER VISION Editors Henri k I. Chi i si en sen & R Jonath* MACHINE PERCEPTION I AR...

Author: Henrik I. Christensen | Jonathon Phillips

123 downloads 1040 Views 9MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

EMPIRICAL EVALUATION METHODS IN COMPUTER VISION Editors

Henri k I. Chi i si en sen & R Jonath*

MACHINE PERCEPTION I ARTIFICIAL INTELLIGENCE! Volume 50 World Scientific

EMPIRICAL EVALUATION METHODS IN COMPUTER VISION

SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:

H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)

Vol. 34: Advances in Handwriting Recognition (Ed. S.-W. Lee) Vol. 35: Vision Interface — Real World Applications of Computer Vision (Eds. M. Cheriet and Y.-H. Yang) Vol. 36: Wavelet Theory and Its Application to Pattern Recognition (Y. Y. Tang, L H. Yang, J. Liu and H. Ma) Vol. 37: Image Processing for the Food Industry (£. R. Davies) Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikainen) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal)

*For the complete list of titles in this series, please write to the Publisher.

Series In Machine Perception and Artificial Intelligence ~ Vol SO

Editors 59

SS S&.

£S SS fS

BS I B 4 &

8S

9SL

S SS £f

Sr,. s f i i n a l s M H a

H ^ S 8 BS S t _ IS 8 ^™fiJS9i^ SS 8S

S^lilillin

J?oyaf Institute of Technology, Stockholm, Sweden

World Scientific New Jersey ^London * Singapore • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

EMPIRICAL EVALUATION METHODS IN COMPUTER VISION Copyright © 2002 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-02-4953-5

Printed in Singapore by Mainland Press

Foreword

For Computer Vision to mature from both scientific and industrial points of view, it is necessary to have methods and techniques for objective evaluation of computer vision algorithms. Towards this end, four workshops have been organised on this topic. The first workshop was on Performance Charactarisation in Computer Vision and organised in association with ECCV-96 in Cambridge, U.K. The second was the First Workshop on Empirical Evaluation Methods in Computer Vision held in conjunction with CVPR 98 in Santa Barbara, California. The third workshop was on Performance Characterisation and organised in association with ICVS-99 in the Canary Islands. The fourth was the Second Workshop on Empirical Evaluation Methods in Computer Vision held on 1 July 2000 in conjunction with ECCV 2000 in Dublin Ireland. The primary goal of these workshops was to give researchers in the computer vision community a venue for presenting papers and discussing methods in evaluation. The secondary goals were to discuss strategies for gaining acceptance of evaluation methods and techniques in the computer vision community and to discuss approaches for facilitating long-term progress in evaluation. This volume contains revised papers from the Second Workshop on Empirical Evaluation Methods in Computer Vision and a two additional papers considered essential to characterising state of the art of empirical evaluation in 2001. We were honoured that Prof. Bowyer and Prof. Forstner accepted our offer to give invited presentations. This volume includes a paper by Prof. Bowyer that summarises his presentation at the workshop. We are most grateful for the support and assistance we have received from the organisation committee. In particular we would like to thank Patrick Courtney, Adrian Clark and David Vernon for their assistance in organising the workshop. In addition are grateful for the assistance offered

VI

Foreword

by World Scientific Press for the preparation of this volume. The support of Yolande Koh and Alan Pui has been particularly instrumental for the completion of this volume. This book was partially supported by the European Commission under the PCCV Contract (IST-1999-14159). This support is gratefully acknowledged.

Stockholm and Gaithersburg, MD, September 2001 H. I. Christensen & P. J. Phillips

Contents

Foreword

v

Contents

vii

Chapter 1 A u t o m a t e d Performance Evaluation of Range Image Segmentation Algorithms 1.1. Introduction 1.2. Scoring the Segmented Regions 1.3. Segmentation Performance Curves 1.4. Training of Algorithm Parameters 1.5. Train-and-Test Performance Evaluation 1.6. Training Stage 1.7. Testing Stage 1.8. Summary and Discussion References

1 2 2 4 6 9 12 15 18 21

Chapter 2 Training/Test D a t a Partitioning for Empirical Performance Evaluation 2.1. Introduction 2.2. Formal Problem Definition 2.2.1. Distance Function 2.2.2. Computational Complexity 2.3. Genetic Search Algorithm 2.4. A Testbed 2.5. Experimental Results 2.6. Conclusions References

23 23 25 26 27 28 30 32 35 36

Chapter 3 Analyzing P C A - b a s e d Face Recognition Algorithms: Eigenvector Selection and Distance Measures 3.1. Introduction 3.2. The F E R E T Database 3.3. Distance Measures

39 40 41 41

viii

Contents

3.3.1. Adding Distance Measures 3.3.2. Distance Measure Aggregation 3.3.3. Correlating Distance Metrics 3.3.4. When Is a Difference Significant 3.4. Selecting Eigenvectors 3.4.1. Removing the Last Eigenvectors 3.4.2. Removing the First Eigenvector 3.4.3. Eigenvalue Ordered by Like-Image Difference 3.4.4. Variation Associated with Different Test/Training Sets 3.5. Conclusion References

42 43 44 45 49 50 51 52 53 56 59

Chapter 4 D e s i g n of a Visual S y s t e m for D e t e c t i n g Natural Events by t h e U s e of an Independent Visual Estimate: A H u m a n Fall Detector 4.1. Introduction 4.2. Approach 4.3. Data Collection 4.4. Velocity Estimation 4.4.1. Colour Segmentation and Velocity Estimation 4.4.2. IR Velocity Estimation 4.4.3. Velocity Correlation 4.4.4. Data combination 4.4.5. Conclusions 4.5. Neural Network Fall Detector 4.5.1. Data Preparation, Network Design, and Training 4.5.2. Testing 4.6. Conclusions References

61 61 62 65 65 66 70 72 73 79 80 80 81 85 87

Chapter 5 Task-Based Evaluation of Image Filtering within a Class of Geometry-Driven-Diffusion Algorithms 5.1. Introduction 5.2. Nonlinear Geometry-Driven Diffusion Methods of Image Filtering 5.3. Diffusion-Like Ideal Filtering of a Noise Corrupted Piecewise Constant Image Phantom 5.4. Stochastic Model of the Piecewise Constant Image Phantom Corrupted by Gaussian Noise 5.5. Estimates of Probabihty Distribution Parameters for Characterization of Filtering Results 5.6. Implementation results 5.7. Conclusions References

89 89 90 93 95 96 101 108 113

Contents Chapter 6 A Comparative Analysis of Cross-Correlation Matching Algorithms Using a Pyramidal Resolution Approach 6.1. Introduction 6.2. Area Based Matching Algorithms 6.3. Cross-Correlation Algorithms 6.4. Pyramidal Processing Scheme 6.4.1. Number of Layers 6.4.2. Decimation Function 6.4.3. Matching Process 6.4.4. Interpolation 6.4.5. Disparity Maps 6.5. Experimental Results 6.5.1. Experiment Layout 6.5.2. Disparity Maps 6.5.3. Disparity Error 6.5.4. Computational Load 6.6. Conclusion References

IX

117 118 119 122 125 127 127 128 129 130 130 130 132 133 136 141 141

Chapter 7 Performance Evaluation of Medical Image Processing Algorithms 143 7.1. Introduction 143 7.2. Presentations 144 7.2.1. New NCI Initiatives in Computer-Aided Diagnosis 144 7.2.2. Performance Characterization of Image and Video Analysis Systems at Siemens Corporate Research 145 7.2.3. Validating Registration Algorithms: A Case Study 146 7.2.4. Performance Evaluation of Image Processing Algorithms in Medicine: A Clinical Perspective 150 7.2.5. Performance Evaluation: Points for Discussion 152 7.3. Panel Discussion 154 References 158

CHAPTER 1 A u t o m a t e d Performance Evaluation of Range Image Segmentation Algorithms

Jaesik Min Computer Science & Engineering, University of South Florida, Tampa, FL, 33620-5399, USA E-mail: [email protected]

Mark Powell Jet Propulsion Labs Pasadena, CA E-mail: [email protected]

Kevin Bowyer Computer Science & Engineering, University of South Florida, Tampa, FL, 33620-5399, USA E-mail: [email protected]

We describe a framework for evaluating the performance of range image segmentation algorithms. The framework is intended to be fully automated and to allow objective and relevant comparison of performance. In principle, it could be used to evaluate general image segmentation algorithms, but the framework is demonstrated here using range images. The framework implemented is in a publicly available tar file that includes images, code, and shell scripts. The primary performance metric is the number of regions correctly segmented. The definition of "correctly segmented" is parameterized on the percent of mutual overlap between a segmented region in an image and its corresponding region in a ground truth specification of the image. This work should make it possible to directly compare the performance of range image segmentation algorithms intended either for planar-surface scenes or for curved-surface scenes.

1

2

Min, Powell and Bowyer

1.1. Introduction Performance evaluation of computer vision algorithms has received increasing attention in recent years. 1,3 ' 4 Region segmentation, like edge detection, is regarded as a fundamental low-level algorithm for image analysis. This paper presents an automated framework for objective performance evaluation of region segmentation algorithms. While the framework is presented in the context of segmentation of range images, it should be applicable to any type of imagery for which ground truth can be reliably specified. Range images seem to be a good initial context for this work because it is relatively unambiguous to specify the surface patches in a range image. The work reported here is an extension of our previous work. 5 Our initial work in performance evaluation of range segmentation compared four algorithms. 6 This study was limited to algorithms that segment images into planar regions. Also, the training to select parameter values for the segmenters was done manually by the algorithm developers. This work was extended to include algorithms that segment range images into curvedsurface patches, 8 to evaluate additional planar-surface algorithms, 9 and to use an automated method of training to select algorithm parameters. 13 Here we are interested in demonstrating a complete, automated framework for objective performance evaluation of (range image) region segmentation algorithms. The framework includes image data sets with manually-specified ground truth, automated scoring of performance metrics, automated training to select algorithm parameters, baseline algorithms for performance comparison, and a test for statistical significance of observed performance differences. 1.2. Scoring the Segmented Regions Our definitions of the performance metrics for region segmentation are based on the definition for region segmentation that is given in similar form in many textbooks. 16 ' 17,18 ' 19 The particular version of the definition that we use is summarized as follows. A segmentation of an image R into regions r\,...,rn is defined by the following properties: (1) ri U r2 U . . . U rn = R. (Every pixel belongs to a region.) (2) Every region is spatially connected. Our implementation currently uses 4-connectedness as the definition of spatially connected. (3) Vrj, rj e R\- i ^ j , ri f~l TJ — 0. (All regions are disjoint.)

Performance Evaluation of Range Image Segmentation

U) Cyberwaie Range Image

3

(b) Manually Specified GT Image

Pig. 1.1. Example range image and corresponding ground truth image. The image has conical and cylindrical foreground surfaces, and planar background surfaces. The ground truth contains regions for these surfaces, plus "artifact regions" for the regions of pixels that correspond to significant artifacts in the image, in this case, "shadow regions."

(4) Wt € R, F ( n ) = true. (All pixels in a region satisfy a specified similarity predicate; in the case of range images, they belong to the same scene surface.) (5) Wi,i'j £ R\- i # j and ri,rj are four-connected and adjacent, Pfa U Tj) = false. (If two regions are four-connected and adjacent then they represent different surfaces.) (6) There are "artifact regions" in the image where no valid measurement was possible which all have the same label (violating rale 2) and for which rules 4 and 5 do not apply. In the ground truth, these essentially are scanner artifacts that the region segmentation algorithm is not expected to handle correctly as a normal region. For each image used in the train or test sets, we manually specify a "ground truth" segmentation of the image conforming to this definition. This is done with the aid of an interactive tool developed for this purpose. Figure 1.1 shows an example range image acquired with the Cyberware range scanner, and the corresponding ground truth specification. A machine segmentation (MS) of an image can be compared to the ground truth (GT) specification for that image to count instances of correct segmentation, under-segmentation, over-segmentation, missed, and noise. We use the same definitions of these performance metrics as used in previous work. 6 The definitions are based on the degree of mutual overlap

4

Min, Powell and Bowyer Machine Segmentation

Ground Truth

MS region A corresponds to GT region 1 as an instance of a correct segmentation. GT region 5 corresponds to MS regions C, D, and E as an over-segmentation. MS region B corresponds to GT regions 2,3, and 4 as an under-segmentation. GT region 6 is an instance of a missed region; MS region F is an instance of noise.

Fig. 1.2.

Illustration of definitions for scoring region segmentation results.

required between a region in the MS and a corresponding region in the GT. For example, an instance of "correct segmentation" is recorded iff a MS region and its corresponding GT region have greater than the required threshold of mutual overlap. Multiple MS regions that correspond to one GT region give rise to an instance of over-segmentation. One MS region that corresponds to several GT regions gives rise to an instance of undersegmentation. A GT region that has no corresponding MS region gives rise to an instance of a missed region. A MS region that has no corresponding GT region gives rise to an instance of a noise region. Figure 1.2 illustrates the definitions of the performance metrics. A comparison tool has been implemented to compare a MS result and its corresponding GT at a specified overlap threshold and automatically score the results. This tool, along with several range image datasets, is available on our lab web pages at http://marathon.csee.usf.edu/seg-comp/SegComp.html. 1.3. Segmentation Performance Curves The meaningful range of required percent overlap between a given MS result and its corresponding GT image is 50% < T < 100%. As the overlap threshold is varied from lower (less strict) to higher (more strict) values, the number of instances of correct segmentation will decrease. See Figure 1.3 for an illustration. At the same time, the number of instances of the different errors will generally increase. A performance curve can be created for each individual metric (correct, under-segmentation, ...) for each image

Performance Evaluation of Range Image Segmentation

level of perfect performance typical pattern of ance

!!l:l:;ti«^^

number

mmmmMMmmmimm*, /

of instances

of correctly segmented regions

iiiiiiiiiilk/ m§mmmmmmMMmMMMWM^

^^^^^^^^^^^S^^Ak

W9^ml^^^m^^^^^k

^^^^^^^^^^^H^^A,

HHHHHB

0.51

0.60

0.70

0.80 0.90

threshold for overlap of region with GT Fig. 1.3.

Example segmentation performance curve.

in a data set. The performance curve shows how the number of instances of the given metric changes for the given image as the overlap threshold varies over its range. Also, an average performance curve can be created for an image data set as a whole. If algorithm A has consistently better performance than algorithm B, then its performance curve for the correct detections metric will lie above that of algorithm B. This comparison can be given a quantitative basis using the "area under the curve." Performance curves can be normalized to a basis where the ideal curve has an area of 1.0. Thus the "Area Under the performance Curve" (AUC) becomes an index in the range of [0.0,1.0], representing the average performance of an algorithm over a range of values for the overlap threshold. The performance curve and the AUC metric as used here have a superficial similarity to the receiver operating characteristic (ROC) curve and the area under the ROC curve. 2 However, the performance curve as used here is not legitimately an instance of an ROC curve, or of a free-response ROC curve (FROC). The image segmentation problem as defined here is to specify a complete decomposition of the image into a set of regions. This is

6

Min, Powell and Bowyer

different from the problem definition that gives rise to an ROG or FROC curve. It is of course possible that the AUC index will obscure situations where, for example, algorithm A is better than algorithm B for low values of the overlap threshold, but worse than B at high values. Thus, in comparing two algorithms, it is important to also consider whether the performance curves cross each other. For experiments reported in this paper, the reported AUC values are computed using a trapezoid rule with overlap threshold value sampled between 0.51 and 0.95. Our general experience is that the performance of current range segmentation algorithms drops rapidly with an overlap threshold any stricter than 0.8, and so there is little value in sampling beyond 0.95. Another way of saying this is that algorithms can often easily segment out 80% of the pixels belonging to a given region, but then have increasing difficulty segmenting a larger percentage of the total region. 1.4. Training of Algorithm Parameters Manual training of algorithm parameters means that results will vary according to the effort and skill of the experimenter. For objective performance evaluation, it is important that the results be reproducible by other researchers. This is why we developed an automated procedure for parameter training. An important question is whether automated training can produce performance as good as manual training. An example comparison of performance curves obtained by manual and automated parameter training appears in Figure 1.4. These curves were created using Jiang and Bunke's algorithm that segments range images into planar regions. 12 The manually selected parameter values for this comparison are the same as those used in Hoover et al.e Automated training 13 of the algorithm parameters was done with the same set of training images as in the study by Hoover et al. Performance curves are plotted for the algorithm trained manually versus trained automatically, for each of two different data sets. The two data sets represent different types of range image acquisition technology. The ABW range scanner works on a structured-light principle, and the Perceptron scanner works on the timeof-flight principle. The performance curves for the ABW data set lie above those for the Perceptron data set. However, different scenes were imaged with the two scanners and so the data does not support any conclusion

Performance

Evaluation

of Range Image

Segmentation

7

Manual vs. Automated Evaluation UB segmenter trained for 4 parameters 100

90

t" 80 o

« 1 70 O

= Manual experiment (Hoover et al.) • Automated experiment

50

40 50

60

70

80

90

100

Compare Tool Tolerance (%)

Fig. 1.4.

Manual versus automated training to select parameter values.

about relative quality of the scanners. The important point is that the differences between the performance curves for manual versus automated training are relatively small, with manual training having a slight advantage on one data set and automated training having a slight advantage on the other data set. Also important is the fact that the differences in performance here due to manual versus automated training are small in comparison to differences between segmenters as observed in Hoover et al.6 This indicates that automated training can provide results comparable to those from manual training by the developer of the algorithm. The UB planar-surface segmenter has 7 input parameters, as shown in Table 1.1. The first two parameters are the most critical. The table lists the values of the seven parameters found by the manual training, 6 and values of the first four parameters as found by our automated training method. The parameter values selected by automated training are generally close to those selected by manual training, but not identical. However, this appears to be just an instance of the general phenomenon of different sets of parameter combinations resulting in similar performance on the test set.

8

Min, Powell and

Bowyer

Table 1.1. Manual and automated training results for UB planar-surface segmenter. The implementation has a total of seven parameters that are thresholds on various values. For this experiment, the automated training was done on the four most important parameters, and the three less important parameters were fixed at the same values found by the earlier manual training. Para.

manual

Zi T2

1.25 2.25 4.0 0.1 3.0 0.1 100

h *2 *3

U *5

ABW automated 1.20 2.90 4.50 0.75 3.0 0.1 100

Perceptron automated manual 1.75 2.75 3.25 3.50 1.70 4.0 0.1 0.125 3.0 3.0 0.2 0.2 150 150

Segmenters may vary in the number of parameters provided for performance tuning. The number of parameters trained is a major factor in the effort required in the training process. For example, training four parameters of the Jiang and Bunke algorithm 12 on ten 10-image training sets (256x256 images from the Cyberware scanner) takes about two days as a background process on a Sun Ultra 5 workstation. The automated training procedure currently used is a form of adaptive search. It operates as follows. Assume that the number of parameters to be trained, and the plausible range of each parameter, are specified. The plausible range of each parameter is sampled by 5 evenly-spaced points. If D parameters are trained, then there are 5 D initial parameter settings to be considered. The segmenter is run on each of the training images with each of these 5 parameter settings. The segmentation results are evaluated against the ground truth using the automated comparison tool. Performance curves are constructed for the number of instances of correct region segmentation, and the areas under the curves are computed. The highest performing one percent of the 5D initial parameter settings, as ranked by area under the performance curve on the training set of images, are selected for refinement in the next iteration (e.g., the top six settings carried forward in training four parameters). The refinement in the next iteration creates a 3 x 3 x ... x 3 sampling around each of the parameter settings carried forward. See Figure 1.5 for an illustration. In this way, the resolution of the parameter settings becomes

Performance

Evaluation

of Range Image 1— i

—

i I

i

1 — r — i i i I

i -

* ^. ^. , v ; : : : : *—&—&—-4-—* *

I

4

i_

i. — _^

Initial sampling of parameter space; best points circled.

|

_ _

T

_ _

y

_ _

|

_ _,

9 -

i

1- - T - - » • - * - - • - - i - i i I i i i 1- - T - - • - * - - • - " I " l 1 I I l I I I I I I I ^ - • • • - • - - • • - • - - • - -I- -

I

Fig. 1.5.

T

Segmentation

1- I 1- I I 1- -

1--)

I - - T - -

| - - i

I - - T - -

Refinement of next iteration.

Illustration of the adaptive sampling of parameter space in the training pro-

finer with each iteration, even as the total number of parameter settings considered is reduced on each iteration. The expanded set of points is then evaluated on the training set, and area under the performance curves again computed. The top-performing points are again selected to be carried forward to the next iteration. Iteration continues until the improvement in the area under the performance curve drops below 5% between iterations. Then the current top-performing point is selected as the trained parameter setting. A more complex algorithm for searching the parameter space has been considered by Cinque et al. 21 They explored an approach based on genetic algorithms. They suggest that the training of the Jiang and Bunke algorithm for curved-surface planar images is relatively sensitive to the composition of the training set. They do not report details of training execution times and composition of training sets, so we cannot make a direct comparison of training approaches on these points. 1.5. Train-and-Test Performance Evaluation It is important to use a "train and test" methodology in evaluating computer vision algorithms. For instance, in an evaluation of edge detection algorithms using the results of a structure-from-motion task, it was found that the ranking of different algorithms changed from the training set results to the test results. 15 Therefore, in our framework, the parameter settings that result from the automated training process are then evaluated using a separate pool of test images. The particular composition of the training set of images can affect the

10

Min, Powell and Bowyer

Fig. 1.6.

Pool of twenty training images of planar-surface scenes (ABW scanner).

trained parameter values, and so also affect the observed test performance. Therefore, it is useful to have multiple separate training sets. In our current implementation, individual training sets are drawn from a pool of twenty training images. The training pool of 20 images of planar-surface scenes is shown in 1.6. These images were taken with the ABW scanner in the computer vision lab at the University of Bern. The training pool of 20 images of curved-surface scenes is shown in 1.7. These images were taken with the Cyberware scanner in the computer vision lab at the University of South Florida. In our current implementation, an individual training set consists of ten images. We create each training set by randomly selecting, without replacement, ten images from the twenty-image training pool We create ten different training sets in this way. Obviously, there is in general some overlap in the images contained in two different training sets. However, any two training sets will have on average only about 50% overlap for ten

Performance Evaluation of Range Image Segmentation

Pig. 1.7.

11

Pool of twenty training images of curved-surface scenes (CyberWare scanner).

images drawn from a twenty-image pool. The number of images in the training pool, the number of training sets, and the number of images in a training set can all be increased. Increasing the size of the training pool requires additional experimental work in acquiring the images and manually specifying the ground truth. Increasing the number of the training sets or the size of the individual training sets translates into increasing the compute time required to train a segmenter. Possible motivations for increasing any of these parameters of the evaluation framework would include (a) to make it possible to reliably measure smaller differences between algorithms, and (b) to make the training set more representative of a broader range of scene content. In addition to the pool of twenty training images, we have a separate pool of twenty test images. Prom the pool of test images, we draw ten different ten-image test sets. The set of parameters resulting from each training set is evaluated using test set. Thus we have a total of 100 test results, with

12

Min, Powell and

Bowyer

each result being a set of performance curves and the corresponding areas under the curves. To allow separate evaluation of segmenters intended for planar-surface scenes or for curved-surface scenes, we have two separate sets of training and test sets. The shapes in the curved-surface scenes are all formed of quadric surface patches, and the ground truth segmentation is in terms of these quadric surface patches. (Due to various difficulties encountered in using the K2T structured-light scanner, 8 we dropped the use of this particular scanner in this project.) 1.6. Training Stage The implementation of the performance evaluation framework is available from our lab web site. It can be downloaded in the form of a compressed UNIX tar file. The tar file contains images of planar-surface and curvedsurface scenes, corresponding ground truth overlays, UNIX scripts, and C source code. There is source code for "baseline" algorithms for planarsurface and curved-surface range image segmentation algorithms. 11,12 There is also source code for the "compare" routine that matches a segmentation result to a ground truth overlay and computes the number of instances of correct segmentation, over-segmentation, under-segmentation, missed and noise regions. The framework should be able to be installed on a variety of UNIX systems with relatively minimal effort. There are two stages to using the framework: training and testing. The first step in the training stage is to reproduce the known training results for the baseline comparison algorithm. Assuming that the installation is already done, this can be done by running a script to train the baseline algorithm on each of the ten ten-image training sets. As an example, assume that we are interested in algorithms for segmentation of curved-surface scenes. The visible result of running the script to train the baseline planarsurface and curved-surface algorithms are the set of ten performance curves shown in Figure 1.8 and Figure 1.9, respectively. For each curve, there is a corresponding area under the curve, and a corresponding set of trained parameter settings used to create the curve. The ten AUCs for the training of the baseline curved-surface segmenter are listed in Table 1.2. The correct results of training the given baseline algorithm are of course already known, as illustrated in Figure 1.9 and Table 1.2. Comparing the results of the training as executed on the user's system to the known cor-

Performance

Evaluation

of Range Image

Segmentation

13

Training Results UB planar-surface segmenter / 4 parameters tuned

to

a O t5 E> k_ o O

70 80 Compare Tool Tolerance (%)

100

Fig. 1.8. Performance curves of UB planar-surface algorithm on the 10 training sets. Each curve represents segmenter performance on a different set of ten images, with the segmenter parameter values trained for best performance on that set of ten images. The curves span a range of performance levels, indicating that the different training sets span a range of levels of difficulty.

Table 1.2. Area under the ten baseline algorithm training curves. These are the values for the area under the performance curve for correct region segmentation for the UB planar-surface segmenter as automatically trained on each of ten different ten-image training sets, as described in the text. The area under the curve is normalized to be between zero and one. T h e overlap threshold was sampled from 0.51 to 0.95. train set AUC value

1 .41

2 .35

3 .36

4 .28

5 .22

6 .47

7 .44

8 .39

9 .34

10 .28

14

Min, Powell and

Bowyer

Training Results UB curved-surface segmenter / 4 parameters tuned

Fig. 1.9. Performance curves of UB curved-surface algorithm on the 10 training sets. Each curve represents segmenter performance on a different set of ten images, with the segmenter parameter values trained for best performance on that set of ten images. The curves span a range of performance levels, indicating that the different training sets span a range of levels of difficulty.

rect results serves to verify that the framework is installed correctly and producing comparable results. Once the user has verified that the framework has been installed correctly, the next step is to train the "challenger" segmentation algorithm. Conceptually, this is done simply by replacing the source code for the baseline algorithm with that for the "challenger" algorithm and running the same training script. The result is a set of training performance curves similar to those in Figure 1.9, with their corresponding AUCs. At this point, the user should carefully consider training performance curves for the challenger algorithm as compared to those of the baseline algorithm. A candidate new segmentation algorithm can be said to be (ap-

Performance

Evaluation

of Range Image

Segmentation

15

parently) performing well to the extent that its training results improve on those of the baseline segmenter. If the challenger algorithm does not outperform the baseline algorithm by some clear increment on the training sets, then there is no point in proceeding to the test stage. The training process can be repeated as many times as desired in the development of the new segmentation algorithm. The challenger algorithm might be enhanced and re-trained until it out-performs the baseline algorithm on the training sets. There is also the possibility that a particular algorithmic approach might be abandoned at this stage. 1.7. Testing Stage A challenger algorithm that out-performs the baseline algorithm by some increment on the training sets can then be compared more rigorously on the test sets. There are ten ten-image test sets. The parameter settings from each of the ten training sets are run on each of the ten test sets. This gives 100 test performance curves and corresponding areas under the curves. The ten performance curves for the baseline curved-surface algorithm, as trained on a particular one of the ten training sets and then tested on each of the ten test sets, are shown in Figure 1.10. The areas under the performance curves for the ten test sets are given in Table 1.3. Note that the absolute performance of the baseline segmenter is only "moderate" on this test set. If a proposed new algorithm truly offers a substantial increment in improved performance, it should result in observable differences in this performance metric. Table 1.3. Area under the ten test curves for a given training result. These are the values for the area under the performance curve for correct region segmentation for the UB planar-surface segmenter as automatically trained on training set number 1, and then tested on each of ten different ten-image test

sets. test set AUC value

1 .38

2 .38

3 .38

4 .38

5 .39

6 .37

7 .37

8 .35

9 .38

10 .38

Note that the challenger test results are "paired" with corresponding results for the baseline algorithm, according to training and testing on the same image sets. This becomes important in the comparison of algorithms. Performance of two algorithms can be compared at a qualitative level by

Min, Powell and

16

Bowyer

Test Results UB curved-surface segmenter / 4 parameters tuned

50

60

70 80 Compare Tool Tolerance (%)

90

Fig. 1.10. Performance curves of UB curved-surface algorithm on the ten test sets. The curves on this plot represent performance using the one set of parameters selected based on a particular training set, and then evaluated on the ten different test sets. The curves are not all visually distinct because the parameters may have resulted in very similar performance on some test sets. The results here span a smaller range of performance than the training results, as would be expected based on the same parameters being used with each test set.

comparing corresponding performance curves, or at a quantitative level by comparing the areas under the performance curves. Both are important. As mentioned earlier, comparing the corresponding performance curves may indicate that relative algorithm performance is not consistent across the full range of overlap threshold values. Assuming that comparison of the ten pairs of corresponding performance curves shows no consistent pattern of intersection, then the question is whether an observed average improvement in the area under the curve is statistically significant. Two levels of statistical test are possible. A simple sign test can be used to check for statistical significance with-

Performance

Evaluation

of Range Image

Segmentation

17

out requiring the assumption that the differences between the two segmenters follows a normal distribution. 2 0 The null hypothesis is that there is no true difference in average performance of the baseline algorithm and a proposed new algorithm. There are one hundred (challenger - baseline) AUC differences. The number of tests on which the challenger shows better performance should follow a binomial distribution. Under the null hypothesis, the challenger would be expected to show better performance than the baseline on fifty of the 100 tests, and the standard deviation would be five. Thus any result outside the range of forty to sixty (plus/minus two standard deviations from the mean) would provide evidence at the a = 0.05 level, of a statistically significant difference in performance. A more powerful statistical test that can be used is the paired-t test. 20 Let Aiti = 1,...100 be the area under the test performance curve for the challenger algorithm. Similarly, let Bi,i = 1,...100 be the area under the test performance curve for the baseline algorithm. The value Di = Ai—Bi is then the difference in the area under the test performance curve, controlled (paired) for training and test sets. The null hypothesis is that there is no true difference in the average value of the performance metric between the two algorithms. The null hypothesis can be rejected if the average of the Di values is sufficiently far from zero. The test statistic is the mean of the Di divided by the standard deviation of the £>». This can be compared against limits of the t distribution for nine degrees of freedom and a chosen level of confidence (a = 0.05). Estimates of the "power" of the paired-t test 20 indicate that using ten training sets should allow reliable detection of increments in performance improvement as small as 10%. As an example of comparing two segmentation algorithms, we step through a comparison of the "YAR" algorithm 7 developed at the University of South Florida to the Jiang and Bunke ("UB") algorithm for segmenting planar-surface scenes. Figure 1.12 plots the one hundred values for the test AUC for each of the two algorithms. This plot gives a clear impression that the UB algorithm performs better than the YAR algorithm. In general, a statistical test might still be needed to determine in the result is significant. However, in this particular case, the UB had a least a slightly higher AUC in every one of the one hundred paired values. A plot of the distribution in the increment of the UB algorithm's improvement in AUC over the YAR algorithm is given in Figure 1.13. This plot shows that the difference in the paired values does not appear to be normal (Gaussian) distributed.

Min, Powell and

Bowyer

pool of twenty test images

training set pool of twenty training images

I

#1

r parameters from set # 1

0 0

0 0 0

o

f

I

"S

training set

#10 J

parameters from set # 10

results of pairing each of ten training sets to each of ten test sets

Fig. 1.11. Use of image training sets and test sets for algorithm evaluation. Evaluation of an algorithm yields one hundred values for area under the performance curve. These can be paired against corresponding values for a different algorithm for a statistical test for significance in algorithm performance.

Therefore the paired-t test would not be appropriate in this case. However, the simpler sign test is still appropriate. And, in this case, with the UB algorithm winning on all one hundred trials, it clearly indicates that the UB algorithm offers a statistically significant improvement in performance. 1.8. Summary and Discussion We have described an automated framework for objective performance evaluation of region segmentation algorithms. Our current implementation of this framework is focused on range images, but the framework can be used with other types of images. The framework requires that ground truth region segmentation be manually specified for a body of images. This body of images is divided into a training pool and a test set. Multiple different training sets are randomly drawn from the training pool. The parameters of the segmentation algorithm are automatically trained on each training set. The different training results are then evaluated on separate test sets. The result is a set of performance curves, one for each combination of training set and test set. The area under the curve is used as an index in a test for statistically significant differences in performance. It is important that the conceptual framework is readily extendible in

Performance

Evaluation

of Range Image

Segmentation

19

Comparison of Performances Performed on ABW test sets

90

85 O 3

<. g; 80

o (D

75 c re
< 70

65

YAR UB Planar Surface Segmenter

Fig. 1.12. Test AUC values for the UB and YAR planar-surface segmentation algorithms. Note that the cloud of points for the UB algorithm has a generally higher AUC value that that for the YAR algorithm. In fact, for every one of the one hundred paired AUC values, the UB algorithm has at least a slightly higher AUC value.

several directions. First, as mentioned, it is applicable to other types of imagery. For example, the data set of texture images used in might be used in the framework to evaluate texture-based segmentation algorithms. 14 Also, the size and number of training sets can be increased. The current implementation should be more than sufficient for recognizing coarse-grain performance improvements, appropriate to the current state of the art in range image segmentation as demonstrated in Figure 1.10, and by Hoover et al.6 However, as the general performance level of range image segmentation algorithms improves, it may become necessary to recognize finer-grain improvements. In this case, the number of images, training sets, and tests sets could be expanded in order to allow reliable observation of smaller

20

Min, Powell and

Bowyer

Histogram for performance difference 15

10

r " i

""im n 10

15

d = UBP - YAR Fig. 1.13. Distribution of improvement in AUC of UB algorithm over YAR algorithm. Note that the distribution does not quite appear Gaussian, as it has a long "tail" to the higher values.

differences in performance. While we have focused on the performance metric for instances of correct segmentation, it is possible to also look at secondary error metrics such as over-segmentation, under-segmentation, missed and noise. This can be important, for example, in applications where the cost of the different types of errors varies significantly. Lastly, we should emphasize that there is both a short-term and a longerterm purpose to this work. In the short term, it provides a means to rank the performance of existing algorithms. However, the longer term and more important purpose is to enable the design of better algorithms. By being able to measure the frequency of different types of errors in segmentation, it should be possible to design new algorithms that address the failure modes of previous algorithms.

Performance Evaluation of Range Image Segmentation

21

Acknowledgments This work was supported by National Science Foundation grants IIS-9731821 and EIA-9729904.

References 1. H. Christensen and W. Foerstner, Machine Vision and Applications 9, 5 (1997). 2. K. W. Bowyer, "Validation of medical image analysis techniques," in Handbook of Medical Imaging: Volume 2 - Medical Image Processing and Analysis J.M. Fitzpatrick and M. Sonka, editors (SPIE Press, 2000), 567-607. 3. K. W. Bowyer and P. J. Phillips, eds., Empirical Evaluation Techniques in Computer Vision, (IEEE Computer Society Press, California, 1998). 4. M. A. Viergever et al, eds., Performance Characterization and Evaluation of Computer Vision Algorithms, (Kluwer Academic Publishers, ,2000). 5. J. Min et al, in Workshop on Applications of Computer Vision, (IEEE Computer Society Press, California, 2000), p.163-168. 6. A.W. Hoover et al, IEEE PAMI 18, 7 (1996), 673-689. 7. A.W. Hoover et al, IEEE PAMI 29, 12 (1998), 1352-1357. 8. M.W. Powell et al, International Conference on Computer Vision, (IEEE Computer Society Press, California, 1998), p.286-291. 9. X. Jiang et al, in International Conference on Pattern Recognition, (IEEE Computer Society Press, California, 2000), p.877-881. 10. X. Jiang and H. Bunke, in Proc. of IAPR Workshop on Machine Vision and Applications, (1996), 538-541. 11. X. Jiang and H. Bunke, in Asian Conference on Computer Vision, (1998), p.299-306. 12. X. Jiang and H. Bunke, in Machine Vision and Applications 7, 2 (1994), 115-122. 13. J. Min et al, in International Conference on Pattern Recognition, (IEEE Computer Society Press, 2000), p.644-647. 14. K. Chang et al, in Computer Vision and Pattern Recognition, (IEEE Computer Society Press, 1999), 1:294-299. 15. M Shin et al, in Computer Vision and Pattern Recognition, (IEEE Computer Society Press, 1998), 190-195. 16. D.H. Ballard and C M . Brown, Computer Vision, (Prentice-Hall, 1982). 17. R.C. Gonzalez and R.E. Woods, Digital Image Processing, (Addison-Wesley Publishing Company, 1992). 18. R.M. Haralick and L.G. Shapiro, Computer and Robot Vision, (AddisonWesley Publishing Company, 1992). 19. M.D. Levine, Vision In Man and Machine, (McGraw-Hill, 1985). 20. M. Bland, An Introduction to Medical Statistics, (Oxford University Press, 1995).

22

Min, Powell and Bowyer

21. L. Cinque et al, in International Conference on Pattern Recognition, (IEEE Computer Society Press, 2000), 474-477.

CHAPTER 2 Training/Test Data Partitioning for Empirical Performance Evaluation

Xiaoyi Jiang, Christophe Irniger, Horst Bunke Department of Computer Science, University of Bern, Neubriickstrasse 10, CH-3012 Bern, Switzerland E-mail: {jiang, irniger, bunke} @iam. unibe. ch

The issue of training/test set design has not gained much attention in works on empirical performance evaluation. Typically, the division of a set of images into training and test set is done manually. But manual partitioning of image sets is always tedious for the human operator and tends to be biased. In this paper it is argued that a systematic, optimizationbased approach is advantageous. We formally state the training/test set design task as a set partition problem in an optimization context. Because of the ATP-hardness of this discrete optimization problem, genetic algorithms are proposed to find suboptimal solutions in reasonable time. We use a range image segmentation comparison task as a testbed to do a validation of our approach. The training/test image set design technique proposed in this paper is very general and can be easily applied to other problem domains. Interestingly, it gives us the possibility of generating both "best-case" and "worst-case" test scenarios, thus providing a rich set of tools for evaluating algorithms from various view points.

2.1.

Introduction

Traditionally, computer vision suffers from a lack of sound experimental work. As pointed out by R. Jain and T. Binford 15 : "The importance of theory cannot be overemphasized. But at the same time, a discipline without experimentation is not scientific. Without adequate experimental methods, there is no way to rigorously substantiate new ideas and to evaluate different approaches." The dialogue on "Ignorance, Myopia, and Naivite in Computer Vision Systems" initiated by R. Jain and T. Binford15 and the responses documented the necessity of evaluating theoretical findings, 23

24

Jiang, Irniger and

Bunke

vision algorithm, etc. by using empirical data. Today sound empirical comparison of vision algorithms is increasingly being accepted as original research, including research aspects such as methods, tools, and databases for performance evaluation, design of experiments, and actual comparison of different algorithms for vision problem, etc. The readers are referred to refs. 2,4 ' 21 for collections and to ref.3 for an overview of recent works on empirical evaluation techniques in computer vision. The present work assumes that ground truth can be determined accurately and used as a comparison basis for the empirical evaluation. To evaluate different algorithms for a particular vision task, a comparative framework consisting of at least the following three elements has to be developed: • imagery design: Acquisition of image data and training/test set partitioning • ground truth: This is mostly generated manually and serves as the basis for the comparison. • performance metrics: They measure the goodness of algorithm results and play a fundamental role in empirical performance evaluation. One concern in imagery design is the possible "dimensions" of the task under consideration, i.e., parameters that (in the ideal case fully) cover the variations of the scenes to be imaged. Theoretically, testing algorithms on an image set that spans the ranges of all these dimensions would yield "failure points" and "tolerances". In practice, however, acquiring, ground-truthing, processing, and analyzing the necessary image data would require a prohibitive amount of effort. As a consequence, an image set of manageable size is usually acquired that reasonably explores the problem dimensions. A second aspect of imagery design is training/test set partitioning. A given image set is divided into two disjoint subsets for training and test purpose, respectively. The training set is used to fix the parameters of algorithms, while performance metrics are computed on images from the test set using the fixed parameters. The training/test set partitioning has not gained much attention so far and is typically done randomly. In this paper we argue that a systematic, optimization-based approach is advantageous (Section 2.2). The resulting discrete optimization task turns out to a MVhard problem. Thus, we have to resort to approximate approaches that find suboptimal solutions in reasonable time. In this work we develop genetic

Training/Test

Data

Partitioning

25

algorithms (GA) for this purpose (Section 2.3). We use the range image segmentation comparison project 13 ' 23 as a testbed to do a validation of our approach (Sections 2.4 and 2.5). The training/test image set design technique proposed in this paper is very general and can be easily applied to other problem domains.

2.2. Formal Problem Definition The characteristics of an image set II can be expressed in terms of image features. For the range image segmentation task 13 ' 23 , for instance, some examples of image features are: • size, contour length, compactness, etc. of regions • angles between adjacent regions Each image feature /j is associated with a distribution (probability) of its values, Pj, over II. Any partition of II into two disjoint subsets III and II2, II1UII2 = II, implies distributions Pn and P{2 over III and n 2 , respectively. Assume that IIi (II2) serves as the training (test) image set. Then we are facing the question what properties the distributions Pa and P,2 should have in order to make the corresponding partition into training/test set most meaningful. In the following we assume that the cardinality of IIi is specified by the user. The usual practice of using the training images to fix the parameters of algorithms implicitly assumes that the test images have characteristics similar to those of the training images. Expressed in terms of image features this means that the training set IIi and the test set II2 should have similar distributions Pa and Pa for all features / j . But in case of a random partitioning of II into IIi and II2, as done in refs. 13 ' 23 for instance, there is a danger of large differences between distributions Pa and Pa of these features, which is clearly in contradiction with the assumption usually made in empirical performance evaluation studies. Assume a function di{Pa,Pi2) that defines the distance, or dissimilarity, between two distributions Pa and Pi2- Then, the discussion above suggests a strategy of finding favorable partitions by minimizing the term: n

J^Wi-diCPii.Pa)

(2.1)

Jiang, Irniger and Bunke

26

Here n is the number of image features under consideration and Wi represents a weighting factor that controls the relative influence of the various features. This understanding of favorable partitions gives us a systematic method for determining training/test image sets. Interestingly, it opens the door to other evaluation scenarios that may be of interest as well. Let us consider a partition that maximizes the term (2.1). This would result in an extreme test case in which the test image set differs to the largest extent (with respect to the measure (2.1)) from the training image set. If we set all weighting factors Wi to be zero except one Wk, then we are potentially able to test the sensitivity of the algorithms concerned with the particular feature fk- This kind of "worst-case" test scenarios and the usual "bestcase" test scenarios together provide a rich set of different scenarios for evaluating algorithms from various view points. Considering both cases we formulate the strategy of finding favorable partitions formally as: Partition problem V: For a given image set II, a number 1 < c < |II|, distance functions di(), and weighting factors w^, find a partition ofH into two disjoint subsets IIi and II2, |IIi| = c, IIi UII2 = II, such that n M = ^Wi •di(Pil,Pi2) i=l

is optimized (minimized or maximized). In this definition a number of specifications are made by the user. The image features and their relative weighting factors have to be chosen. Also the distance function di() between two distributions is to be denned. 2.2.1. Distance

Function

1

In ref. Basseville reviews a large number of distance functions. In particular, a general class of so-called /-divergence functions is discussed. Given two distributions Pi and P2, the divergence function is denned by: d(P1,P2)

=

9[E[f(^)}}

where / is a continuous convex real function, E corresponds to the expectation with respect to Pi, and g represents an increasing real function. The

Training/Test Data Partitioning

27

divergence function d(Pi, P 2 ) has the important property that it is minimal when Pi = P 2 . Some examples of this class of distance functions are: • variational distance: f{x) = |1 - x\, g(x) = \x, d ( P i , f t ) = \ J\Pi

- P2\dx

• Hellinger distance: f{x) = (1 - v/i) 2 , gix) = \x, d(Pi, Pi) = \ Uyfp\ f

-

y/Pi)2dx

Pi

fix) = - logo;, gix) = x, d(Pi, P 2 ) = / Pi log — dx Kullback divergence: fix) = (x-1) log a;, 5(x) = x, d(Pi, P 2 ) = / Pi log ydx+

/ P 2 log - ^ d x

Other examples are Chernoff distance, Bhattacharyya distance, and Mahalanobis distance, all being popular in signal processing and pattern recognition. In practice a probability distribution becomes a normalized histogram over a discrete set of values, where the normalization means that the sum of the histogram entries amounts to one. Accordingly, the integrals in the distance functions above have to be replaced by summations. In some cases the discrete nature requires further modifications. The variational distance, for example, leads to the Li-metric, which, as pointed out in ref.25, is not an appropriate distance function for comparing histograms. Instead, cumulative histograms should be used. Histograms are usually not completely dense vectors, i.e., some of the entries are zero. This fact results in some difficulties in applying the class of /-divergence functions in general, examples are the Kullback information and divergence function. One possible solution is to add one to each histogram entry before normalization.

2.2.2. Computational

Complexity

The partition problem V belongs to the class of TVP-hard discrete optimization problems. To prove the TVP-hardness of V we use the following result: Lemma: Given a finite set A and a function s(a) 6 Z+ (positive integers)

Jiang, Irniger and Bunke

28

for each a £ A, the problem if there is a subset A\ C A such that s

Yl s(a) =

a€Ai

^

5Z

a&A2(=A-Ai)

holds is MP -complete. See ref.9 for a proof. Based on this lemma we can proof: Corollary: The partition problem P is MP-hard. Proof We transform the partition problem P* given in the lemma to P. First we assume that the cardinality of the solution A\, if existent, is known. We consider each element a £ A as an "image" that has a single feature f\ taking the value s(a). For a partition of A into disjoint subsets A\ and A2, A\ U A2 — A, the distance d(Pi, P2) of the two distributions Pi and Pi for /1 is defined to be:

d(PuP2) = | £ > ( a ) - 5>(a)| a&Ai

a&A2

Now we have an instance of partition problem V (minimization version) and its solution immediately tells us if the partition problem V*, for a particular cardinality \A\\, is solvable. Since \A±\ is unknown, we actually have to consider a series of values 1,2,..., '-^- for \A\\ and generate a total of hf- instances of V. Their solution finally gives the answer to the partition problem V* given in the lemma. This shows that P* is reducible to V in polynomial time. Thus, the partition problem V is ATP-hard. QED The TVP-hardness of the partition problem V has the consequence that we are forced to develop approximate approaches to finding suboptimal solutions in reasonable time. 2.3. Genetic Search Algorithm The task under consideration here belongs to a wide class of discrete optimization problems which are of pivotal importance in pattern recognition and computer vision. Although the theory of continuous optimization is mature, discrete or configurational optimization is still in its infancy. Recent techniques for solving such optimization problems include simulated annealing 18 , mean-field annealing 27 , tabu search 10 , and various approaches to evolutionary computation 8 including genetic search 11 ' 19 .

Training/Test Data Partitioning

29

Genetic search techniques are motivated by concepts from the theory of biological evolution. They are general-purpose optimization methods that have been successfully applied to difficult tasks in search, optimization, machine learning, etc. In particular it has been shown that genetic algorithms are useful for finding approximate solutions to general TVP-complete problems 5,7 . Also in pattern recognition and computer vision, genetic search has found numerous applications; some examples are refs. 16 ' 17,20 ' 22,24 . Basic to genetic search is the idea of maintaining a population of chromosomes representing possible solutions to the discrete optimization problem at hand. A cost function, termed the fitness, is associated with the solution candidates. Given an initial population, genetic algorithms use genetic operators to alter chromosomes in the population and create a new generation. The genetic operator crossover involves selecting pairs of solution candidates randomly from the current population and interchanging the solutions at selected configuration sites. Mutation aims at introducing new information into the population by randomly altering the component symbols of individual solution candidates. The process of population generation is repeated until a stop condition is satisfied. The reader is referred to refs. 11 ' 19 for more details on genetic algorithms in general. As argued in ref.6, genetic algorithms have a number of advantages over other discrete optimization methods. One of the unique features of genetic search is given by the crossover operator. It effectively provides a means of combining locally consistent subsolutions to generate a globally consistent solution. In addition, although discrete gradient-ascent optimization with multiple random starts shares the idea of maintaining a population of alternative solutions, it is the genetic operators that ensure a higher likelihood of global convergence. For these reasons we resort to a genetic search strategy in this work. In our task the representation of a potential solution in terms of a chromosome is straightforward. It is simply a string of binary bits of length equal to the size of the given image set II. Each bit corresponds to an image in II and takes value one and zero if the corresponding image is in the partition subset IIi and II2, respectively. Since the cardinality of III is specified by the user, a chromosome has |IIi| one-bits and |II| — |IIi| zero-bits. The fitness function is given by the optimization term of the partition problem V. The crossover and mutation operator are the standard ones for strings. In our particular application, however, it can happen that

30

Jiang, Irniger and

Bunke

a resultant chromosome is no longer a valid representation of a solution candidate, i.e. the number of one-bits is not identical to |IIi|. This problem can be easily tested and solved by randomly changing some bits until the chromosome becomes a valid one. There are a number of strategies known for improving the ordinary genetic algorithm. One example is a hybrid approach combining the genetic algorithm with a local hill climbing that has been applied, for instance, in recent works on GA-based graph matching 6 ' 26 . The idea is that once an offspring chromosome is created by the genetic operators, a hill climbing is performed to search for the optimal chromosome in a local neighborhood. By doing this the genetic algorithm only deals with locally optimal chromosomes, thus increasing the chance of finding a global optimum. In our case the local neighborhood of a chromosome may be constructed in the following way.' We flip each one-bit in turn. To keep the chromosome valid we have to flip a zero-bit at the same time. We may build pairs of a onebit with all possible zero-bits. For efficiency reason, however, we randomly select only one zero-bit for each one-bit. In total an offspring chromosome has therefore |IIi| neighbors which are evaluated by the fitness function and compared with the offspring chromosome. Only the best will take a place in the population. This hybrid approach is potentially more powerful in finding the global optimum at the price of a higher computational demand. 2.4. A Testbed We use the range image segmentation comparison project 13 ' 23 as a testbed for our training/test set design approach. This study covers the full spectrum of an empirical performance evaluation framework, including large image databases, performance metrics, public availability of all data, software and results of evaluated algorithms. Under consideration of the large number of factors in the test imagery design, three image sets were acquired using different sensors. Table 2.1 summarizes their characteristics. The ground truth segmentation was generated by human operators. Figure 2.1 gives an example from the ABW image set, showing the intensity image, the range image, and the ground truth segmentation (from left to right). In refs. 13 ' 23 the ABW/Perceptron/K2T image set was divided into a training set of 10/10/20 images and a test set of 20/20/40 images. The partition was done manually, basically by human inspection of the images. For testing our training/test set design approach we define the following

31

Training/Test Data Partitioning Table 2.1.. Characteristics of the three image. sensor- type. # images resolution surfaces imaging volume

ABW structured, lighfe

Perceptron

40

4B 512 x 512 planar room-size

512 x 512 planar table-top size

TOP.

Pig. 2.1. An example image and ground truth

K2T atructured light 60 480" x 640 ^ta«/curved_ table-togjjjgg_

from ABW set.

features; a • • » a •

region type: planar, cone, cylinder, sphere, torus region size: number of pixels region perimeter region compactness: region shape defined by perimeter /size angle between two adjacent regions length of contours between two adjacent regions

For the ABW image shown in Figure 2.1, for instance, the segmentation ground truth contributes the following angle values to the overall, angle distribution: 34.6°(1), 54.0°(1), 58.4°(1), 61.3°(1), 66.6°(1), 68.2»(2), 75.F(2), 84.5°(1), 90.0°(10), where the number in brackets indicates the number of occurrences of the angle. If we only consider the angle feature, i.e. set all. weighting factors w< except that for aagje to zero, then the manual partition used in ref.13 leads to M = &MM72 by using the variational distance function. This is illustrated in Figure 2.2 which shows the angle distribution of both training and test set.

32

Jiang, Irniger and Bunke

0.8 r-

0.7 •

Legend aimng distribution interpolated Training distribution points Test distnbution interpolated Test distribution points

0.6 -

0.5 -

0.3 -

0.2 -

0.1 -

30

40

50

60

70

80

90

Angles

Fig. 2.2.

Angle distribution of training/test set: ABW, manual partition.

2.5. E x p e r i m e n t a l R e s u l t s The proposed genetic algorithm has been implemented in C + + using the genetic algorithm library GAlib a . The validation of our training/test set design approach in the testbed described in the last section is done by a series of experiments. In this section we report some of the experimental results. The full details can be found in ref.14. There are a variety of parameters to be set for genetic algorithms, the most important being population size, crossover and mutation probability. Several studies 12 ' 19 have tried to find some generic optimal parameter settings for a broad class of problems. Mostly, a vast number of possible parameter settings are evaluated to obtain good ones. An interesting alternative was reported by Grefenstette 12 . He optimizes the control parameters of genetic algorithms using another meta-level genetic algorithm and finally suggests three parameter settings. We have experimentally evaluated the three settings and decided to use 50 (population size), 0.6 (crossover probability), and 0.001 (mutation probability) throughout all our experiments. Using the angle feature alone, the hybrid genetic algorithm is used to generate a best and a worst partition by minimizing resp. maximizing the "Available at http://lancet.mit.edu/ga.

Training/Test Table 2.2.

Data

33

Partitioning

Optimization term M: hybrid GA vs. manual partition. Best partition

Worst partition

Manual partition

0.001448 0.002409 0.004296

0.116354 0.199991 0.630148

0.016472 0.033738 0.027788

ABW Perceptron K2T

Leqend best partition manual partition worst Dartition

K2T angles

-j

•

PERC angles

I

j

ri

database/feature

Fig. 2.3.

r~l

i

Optimization term M.: hybrid GA vs. manual partition.

optimization term M. The computed partitions result in the M. values recorded in Table 2.2, or graphically in Figure 2.3. For the ABW image set, the angle distributions of both partitions are graphically illustrated in Figure 2.4, see Figure 2.2 for a comparison with the manual partition. Note that because of the stochastic nature of genetic search we ran the program several times and recorded the best results. For comparison purpose the M values of the manual partition is given in Table 2.2 as well. In all cases the best partition found by the hybrid genetic algorithm is better than the corresponding manual partition. The worst partition provides us, on the one hand, an extreme test situation, in which the training and test images have quite different characteristics with respect to the angle feature. On the other hand, it represents an upper limit of the badness of manual partitions which a human operator may produce in his efforts to create a good partition. The ordinary genetic algorithm is faster than the hybrid one. The results are slightly worse than, but comparable to, those by the hybrid GA. Thus, in all cases both genetic algorithms are able to improve the manual partition.

Jiang, Irniger and Bunke

34

Legend Training distribution interpolated Training distribution points + Test distribution interpolated Test distribution points »

Angles

Legend Training distribution interpolated Training distribution points Test distnbution interpolated Test distribution points

*

I 0A

_iblE=Z=Efc

rr

*Nt., Angk

Fig. 2.4. Angle distribution of training/test set: ABW, hybrid GA. Best (top) and worst partition (bottom). Besides the best results recorded in Table 2.2 is is also of interest how much the genetic search results vary dependent on different random seeds. Table 2.3 shows the range of results obtained from all runs of the program. Like any other stochastic optimization technique, genetic algorithms cannot guarantee to provide the true optimum. Therefore, it is interesting to investigate the quality of the computed partitions compared to the truly

Training/Test Table 2.3. ABW Perceptron K2T

Data

Partitioning

35

Range of results of all runs.

Best partition

Worst partition

[0.001448, 0.002172] [0.002409, 0.005666] [0.004296, 0.006314]

[0.107529, 0.116354] [0.172333, 0.199991] [0.630148, 0.630148]

optimal partition. Since the partition problem V is TVP-hard, we have no means to compute the true optimum for image sets of large sizes, as is the case for all our three image sets, in reasonable time. In order to get some evidence of solution quality we have conducted a comparison on a reduced set of images. Concretely, we randomly selected thirty out of the forty ABW images, which are then divided into a set of 10 images and a second one of 20 images. A combinatorial search without any pruning was done. The genetic algorithms were run to produce approximate solutions. In this test we used all the features listed in Section 2.4 except the region type which is irrelevant here. The weighting factors Wi were all set to one. The achieved results are: true optimum

ordinary GA

hybrid GA

0.005366

0.006622

0.005775

The approximate solutions found by the genetic algorithms are not optimal, but near the true optimum. The combinatorial search was carried out on eight SUN Ultra5 workstations in parallel and took about 17 days. On the other hand, the ordinary and hybrid genetic algorithm needed only 15 minutes and 90 minutes, respectively, on a single SUN Ultra5 workstation. These and other results documented in ref.14 indicate that the proposed genetic algorithms are able to produce near-optimal partitions of image sets. 2.6. Conclusions The problem of training/test set design has not gained much attention so far. Partitioning a database of images into a training and a test set is typically done manually. In this paper we have argued that a systematic, optimization-based approach is advantageous. Manual partitioning of image sets is always tedious for the human operator and tends to be biased. A systematic approach, on the other hand, formulates the partitioning cri-

36

Jiang, Irniger and

Bunke

teria implicitly used the h u m a n operator in an explicit way. By doing this we obtain a clear understanding of what we want to achieve. An optimization algorithm then makes sure t h a t the computed partitions are (nearly) optimal. Interestingly, this approach gives us t h e possibility of generating b o t h "best-case" and "worst-case" test scenarios, thus providing a rich set of tools for evaluating algorithms from various view points. We have formally formulated the t r a i n i n g / t e s t set design task as a set partition problem in an optimization context, which t u r n s out to be A/'P-hard. Genetic algorithms have been proposed t o solve t h e optimization problem. Embedded in range image segmentation comparison we have demonstrated the ability of the genetic algorithms to obtain suboptimal solutions. T h e t r a i n i n g / t e s t image set design technique proposed in this p a p e r is very general and can be easily applied to other problem domains. Acknowledgments T h e a u t h o r s would like to t h a n k M a t t h e w Wall at the Massachusetts Instit u t e of Technology for making the GAlib genetic algorithm package publically available.

References 1. M. Basseville, Distance measures for signal processing and pattern recognition, Signal Processing, 18(4): 349-369, 1989. 2. K.W. Bowyer, and P.J. Phillips (eds), Empirical Evaluation of Techniques in Computer Vision, IEEE Computer Society Press, 1998. 3. K.W. Bowyer, and P.J. Phillips, Overview of work in empirical evaluation of computer vision algorithms, in ref. , 1-11, 1998. 4. H. Christensen, and W. Forstner (eds), Special issue on performance evaluation. Machine Vision and Applications, 9(5/6): 215-340, 1997. 5. A.L. Corcoran and R.L. Wainwright, Using LibGA to develop genetic algorithms for solving combinatorial optimization problems, in L. Chambers (Ed.), Practical Handbook of Genetic Algorithms: Applications, Volume I, 143-172, CRC Press, 1995. 6. A.D.J. Cross, R.C. Wilson, and E.R. Hancock, Inexact graph matching using genetic search, Pattern Recognition, 30(6): 953-970, 1997. 7. K.A. DeJong and W.M. Spears, Using genetic algorithms to solve NPcomplete problems, Proc. of 3rd Int. Conf. on Genetic Algorithms, Fairfax, VA, 124-132, 1989. 8. D.B. Vogel and L.J. Fogel (Eds.), Special Issue on Evolutionary Computation, IEEE Trans, on Neural Networks, 5(1), 1994.

Training/Test Data Partitioning

37

9. M.R. Garey and D.S. Johnson, Computers and Intractability - A Guide to the Theory of NP-completeness, W.H.Freeman and Company, 1979. 10. F. Glover and M. Laguna, Tabu search, Kluwer Academic Publishers, 1997. 11. D.E. Goldberg, Genetic algorithms in search, optimization and machine learning, Addison-Wesley, 1989. 12. J.J. Grefenstette, Optimization of control parameters for genetic algorithms, IEEE Trans, on SMC, 16(1): 122-128, 1986. 13. A. Hoover, G. Jean-Baptiste, X. Jiang, P.J. Flynn, H. Bunke, D. Goldgof, K. Bowyer, D. Eggert, A. Fitzgibbon, and R. Fisher, An experimental comparison of range image segmentation algorithms, IEEE Transactions on PAMI, 18(7): 673-689, 1996. 14. C. Irniger, Design of Image Data Sets for Segmentation Performance Evaluation, Master's Thesis, University of Bern, 2000. 15. R. Jain and T. Binford, Ignorance, myopia, and naivite in computer vision systems, CVGIP: Image Understanding, 53(1): 112-117, 1991. 16. A. Jain and D. Zongker, Feature selection - evaluation, application, and small sample performance, IEEE Trans, on PAMI, 19(2): 153-158, 1997. 17. A.J. Katz and P.R. Thrift, Generating image filters for target recognition by genetic learning IEEE Trans, on PAMI, 16(9): 906-910, 1994. 18. S. Kirkpatrick, C D . Gelatt, and M.P.Vecchi, Optimization by simulated annealing, Science, 220: 671-680, 1983. 19. M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, 1996. 20. M. Parizeau, N. Ghazzali, and J.F. Hebert, Optimizing the cost matrix for approximate string matching using genetic algorithms, Pattern Recognition, 31(4): 431^440, 1998. 21. P.J. Phillips and K.W. Bowyer (eds), Special section on empirical evaluation of computer vision algorithms, IEEE Transactions on PAMI, 21(4): 289-335, 1999. 22. J. Pittman and C.A. Murthy, Fitting optimal piecewise linear functions using genetic algorithms, IEEE Trans, on PAMI, 22(7): 701-718, 2000. 23. M.W. Powell, K.W. Bowyer, X. Jiang, and H. Bunke, Comparing curvedsurface range image segmenters, Proc. of the 6th Int. Conf. on Computer Vision, Bombay, India, 286-291, 1998. 24. G. Roth and M.D. Levine, Geometric primitive extraction using a genetic algorithm, IEEE Trans, on PAMI, 16(9): 901-905, 1994. 25. M. Strieker and M. Orengo, Similarity of color images, SPIE Vol. 2420, Storage and Retrieval for Image and Video Databases III, 381-392, 1995. 26. Y.-K. Wang, K.-C. Fan, and J.-T. Horng, Genetic-based search for errorcorrecting graph isomorphism, IEEE Trans, on SMC - Part B: Cybernetics, 27(4): 588-597, 1997. 27. A. Yuille, Generalised deformable models, statistical physics and matching problems, Neural Computing, 2: 1-24, 1990.

This page is intentionally left blank

CHAPTER 3 Analyzing PCA-based Face Recognition Algorithms: Eigenvector Selection and Distance Measures

Wendy S. Yambor, Bruce A. Draper, J. Ross Beveridge Computer Science Department Colorado State University Fort Collins, CO, U.S.A 80523 E-mail: draper/[email protected]. edu

This study examines the role of Eigenvector selection and Eigenspace distance measures on PCA-based face recognition systems. In particular, it builds on earlier results from the FERET face recognition evaluation studies, which created a large face database (1,196 subjects) and a baseline face recognition system for comparative evaluations. This study looks at using a combinations of traditional distance measures (City-block, Euclidean, Angle, Mahalanobis) in Eigenspace to improve performance in the matching stage of face recognition. A statistically significant improvement is observed for the Mahalanobis distance alone when compared to the other three alone. However, no combinations of these measures appear to perform better than Mahalanobis alone. This study also examines questions of how many Eigenvectors to select and according to what ordering criterion. It compares variations in performance due-to different distance measures and numbers of Eigenvectors. Ordering Eigenvectors according to a like-image difference value rather than their Eigenvalues is also considered.

Acknowledgments We thank the National Institute of Standards and Technology for providing us with the results and images from the FERET evaluation. We thank Geof Givens from the Statistics Department at Colorado State University for his insight in pointing us toward McNemar's Test. This work was supported in part through a grant from the DARPA Human Identification at a Distance Program, contract DABT63-00-1-0007: 39

40

Yambor, Draper and

Beveridge

3.1. Introduction Over the past few years, several face recognition systems have been proposed based on principal components analysis (PCA) 14,8,13,15,1,10,16,6 ^ 1 though the details vary, these systems can all be described in terms of the same preprocessing and run-time steps. During preprocessing, they register a gallery of m training images to each other and unroll each image into a vector of n pixel values. Next, the mean image for the gallery is subtracted from each and the resulting "centered" images are placed in a gallery matrix M. Element [i, j] of M is the ith pixel from the jth image. A covariance matrix CI = M MT characterizes the distribution of the m images in 5R". A subset of the Eigenvectors of fi are used as the basis vectors for a subspace in which to compare gallery and novel probe images. When sorted by decreasing Eigenvalue, the full set of unit length Eigenvectors represent an orthonormal basis where the first direction corresponds to the direction of maximum variance in the images, the second the next largest variance, etc. These basis vectors are the Principal Components of the gallery images. Once the Eigenspace is computed, the centered gallery images are projected into this subspace. At run-time, recognition is accomplished by projecting a centered probe image into the subspace and the nearest gallery image to the probe image is selected as its match. There are many differences in the systems referenced. Some systems assume that the images are registered prior to face recognition 15>10>n>16; among the rest, a variety of techniques are used to identify facial features and register them to each other. Different systems may use different distance measures when matching probe images to the nearest gallery image. Different systems select different numbers of Eigenvectors (usually those corresponding to the largest k Eigenvalues) in order to compress the data and to improve accuracy by eliminating Eigenvectors corresponding to noise rather than meaningful variation. To help evaluate and compare individual steps of the face recognition process, Moon and Phillips created the FERET face database, and performed initial comparisons of some common distance measures for otherwise identical systems 10>1:l>9. This paper extends their work, presenting further comparisons of distance measures over the FERET database and examining alternative way of selecting subsets of Eigenvectors.

Analyzing

PCA-based Face Recognition

Algorithms

41

3.2. The F E R E T Database For readers who are not familiar with it, the FERET database contains images of 1,196 individuals, with up to 5 different images captured for each individual. The images are separated into two sets: gallery images and probes images. Gallery images are images with known labels, while probe images are matched to gallery images for identification. The database is broken into four categories: F B Two images were taken of an individual, one after the other. In one image, the individual has a neutral facial expression, while in the other they have non-neutral expressions. One of the images is placed into the gallery file while the other is used as a probe. In this category, the gallery contains 1,196 images and the probe set has 1,195 images. Duplicate I The only restriction of this category is that the gallery and probe images are different. The images could have been taken on the same day or a year apart. In this category, the gallery consists of the same 1,196 images as the FB gallery while the probe set contains 722 images. fc Images in the probe set are taken with a different camera and under different lighting than the images in the gallery set. The gallery contains the same 1196 images as the FB & Duplicate I galleries, while the probe set contains 194 images. Duplicate II Images in the probe set were taken at least 1 year after the images in the gallery. The gallery contains 864 images, while the probe set has 234 images. This study uses FB, Duplicate I and Duplicate II images. 3.3. Distance Measures In 9 , Moon and Phillips look at the effect of four traditional distance measures in the context of face recognition: city-block (LI norm), squared Euclidean distance (L2 norm), angle, and Mahalanobis distance. The appendix in 9 formally defines these distance measures and for completeness these definitions are repeated here in Appendix A. There was one minor problem encountered with the definition of Mahalanobis distance given in 9 and this is discussed in Appendix A. This paper presents further evaluations of traditional distance measures

Yambor, Draper and

Beveridge

Table 3.1. Percent of probe images correctly recognized for combined classifiers: a) base distance measures and summed combinations on Duplicate I images, b) classifiers using bagging on Duplicate I and F B images. Classifier

Dup. I

LI L2 Angle Mahalanobis S(L1, L2) S(L1, Angle) S(L1, Mahalanobis) S(L2, Angle) S(L2, Mahalanobis) S(Angle, Mahalanobis) S(L1, L2, Angle) S(L1, L2, Mahalanobis) S(L1, Angle, Mahalanobis) S(L2, Angle, Mahalanobis) S(L1, L2, Angle, Mahalanobis)

35 33 34 42 35 39 43 33 42 42 35 42 43 42 42

(a) Classifier LI L2 Ang Mah Bagging Bagging, Best 5 Bagging, Weighted

Dup. I 35 33 34 42 37 38 38

FB 77 72 70 74 75 78 77

(b)

in the context of face recognition. In particular, we considered the hypothesis that some combination of the four standard distance measures (LI, L2, angle and Mahalanobis) might outperform the individual distance measures. To this end, we test both simple combinations of the distance measures, and "bagging" the results of two or more measures using a voting scheme 2 ' 4,7 . 3.3.1. Adding Distance

Measures

The simplest mechanism for combining distance measures is to add them. In other words, the distance between two images is denned as the sum S of

Analyzing

PCA-based Face Recognition

Algorithms

43

the distances according to two or more traditional measures: S(ai, ...,ah)

= ai + ... + ah

(3.1)

Using S, all combinations of the base metrics (LI, L2, angle, Mahalanobis) were used to select the nearest gallery image to each probe image. The percentage of images correctly recognized using each combination is shown in Table 3.1a, along with the recognition rates for the base measures themselves. Of the four base distance measures, there appears to be a significant improvement with Mahalanobis distance. On the surface, 42% seems much better than 33%, 34% or 35%. The best performance for any combined measure was 43% for the S(L1, Angle, Mahalanobis) combination. While higher, this does not appear significant. The question of when a difference is significant will be taken up more formally in Section 3.3.4. Interestingly, the performance of the combined measures were never less than the performance of their components evaluated separately. For example, the performance of S(L1, L2) is 35%; this is better than the performance of L2 (33%) and the same as LI (35%). 3.3.2. Distance

Measure

Aggregation

The experiment above tested only a simple summation of distance measures; one can imagine many weighting schemes for combining distance measures that might outperform simple summation. Rather than search the space of possible distance measure combinations, however, we took a cue from recent work in machine learning that suggests the best way to combine multiple estimators is to apply each estimator independently and combine the results by voting 2 ' 4 ' 7 . For face recognition, this implies that each distance measure is allowed to vote for the image that it believes is the closest match to a probe. The image with the most votes is chosen as the matching gallery image. Voting was performed three different ways. Bagging Each classifier is given one vote as explained above. Bagging, Best of 5 Each classifier votes for the five gallery images that most closely match the probe image. Bagging, Weighted Classifiers cast five votes for the closest gallery image, four votes for the second closest gallery image, and so on, casting just one vote for the fifth closest image.

Yambor, Draper and

44

Beveridge

Table 3.2. Correlation of classifiers on the duplicate I probe set. A l g o r i t h m Pairs Alg. 1 Alg. 2 Ll L2 Angle LI Mah. Ll Angle L2 Mah. L2 Mah. Angle Table 3.3. classifiers.

Images

Images that

1 46 (26%)

Rank Correlation 0.46 0.39 0.39 0.62 0.50 0.58

were poorly

identified

Classifier 3 in Error 2 3 48 (27%) 34 (19%)

by

the

4 51 (28%)

Table 3.1b shows the performance of voting for the Duplicate I and FB probe sets. On the Duplicate I data, Mahalanobis distance alone does better than any of the bagged classifiers: 42% versus 37% and 38%. On the simpler FB probe set, the best performance for a separate classifier is 77% (for Ll) and the best performance for the bagged classifiers is 78%: not an apparently significant improvement. In the next section we explore one possible explanation for this lack of improvement when using bagging. 3.3.3. Correlating 2

Distance

Metrics

As described in , the failure of voting to improve performance suggests that the four distance measures share the same bias. To test this theory, we correlated the distances calculated by the four measures over the Duplicate I probe set. Since each measure is defined over a different range, Spearman Rank Correlation was used 12 . For each probe image, the gallery images were ranked by increasing distance to the probe image. This is done for each pair of distance measures. The result is two rank vectors, one for each distance measure. Spearman's Rank Correlation is the correlation coefficient for these two vectors. Table 3.2 presents the average correlation coefficient over all probe images for pairs of distance measures. Ranks based upon L2, Angle and Mahalanobis all correlate very closely to each other, although Ll correlates less well to Angle and Mahalanobis. This suggests that there might be some

Analyzing

PCA-based Face Recognition

Algorithms

45

advantage to combining LI with Angle or Mahalanobis, but that no combination of L2, Angle or Mahalanobis is very promising. This is consistent with the scores in Table 3.1a, which show that the combination of LI k Angle and LI &: Mahalanobis outperform these classifiers individually. We also constructed a list of images in the FB probe set that were grossly misclassified, in the sense that the matching gallery image was not one of the ten closest images according to one or more of the distance measures. A total of 179 images were poorly identified by at least one distance measure. To illustrate how difficult some of these recognition problems are, Figure 3.1 shows two images of the same individual that were consistently misclassified. Seeing these images, it is easy to see why classification algorithms have difficulty with this pair. Table 3.3 shows the number of images that were poorly identified by one distance measure, two distance measures, three distance measures and all four distance measures. This table shows that there is shared bias among the classifiers, in that they seem to make gross mistakes on the same images. On the other hand, the errors do not overlap completely, suggesting that some improvement might still be achieved by some combination of these distance measures. 3.3.4. When Is a Difference

Significant

The data in Table 3.1 begs the question of when a difference in performance is significant. Intuitively, a 1% difference seems likely to arise by chance, while a 10% difference does not. However, to move beyond intuition requires that we formulate a precise hypothesis that can be evaluated using standard statistical hypothesis testing techniques. Moreover, even for such an apparently simple comparison as presented in Table 3.1, there are at least two very distinct ways to elaborate the question. First, is the difference as seen over the entire set of probe images significant. Second, when the algorithms behave differently, is the difference significant. Let us approach the question of significance over the entire sample first. In keeping with standard practice for statistical hypothesis testing, we must formulate our hypothesis and the associated null hypothesis. For simplicity, we will speak of each variant as a different algorithm. For example, when comparing the standard PCA classifier using the LI distance to the standard PCA classifier using the L2 distance, the first may be designated algorithm A and the other algorithm B.

46

Yarabor, Draper and Beveddge

Pig, 3.1. Individual identified poorly in the duplicate I probe set. a) raw image, b) normalized.

H I Algorithm A correctly classifies images more often than does algorithm B.

Analyzing

PCA-based Face Recognition

Algorithms

47

HO There is no difference in how well Algorithms A and B classify images. To gain confidence in HI, we need to establish that the probability of HO is very small given our observation. The mechanics of how this is done appear in standard Statistics texts 3 . We review the testing procedure in Appendix 3.5, Section 3.5. Briefly, based upon the number of times each algorithm is observed to succeed, a normalized variable z is computed. The probability the null hypothesis is true, PHO, is determined from z. The z values and probabilities are shown in Table 3.4a for the six pairwise comparisons between the four base distance measures: LI, L2, Angle and Mahalanobis. The numbers of images correctly identified by these four measures are 253, 239, 246 and 305 respectively. There are a total of 722 images. A common cutoff for rejecting HO is PHO < 0.05. Using this cutoff, the only statistically significant differences are between Mahalanobis and the others. The test of significance just developed is useful, but it fails to take full advantage of our experimental protocol. Specifically, the tests of the different distance measures are not on independently sampled images from some larger population, but are instead on the same set of sampled images. This observation leads us to the second and more discriminating question and associated test. Our hypotheses in this case become H I When algorithm's A and B differ on a particular image, algorithm A is more likely to correctly classify that image. HO When algorithm's A and B differ on a particular image, each is equally likely to classify the image correctly. To test this hypothesis, we need to record which or four possible outcomes are observed for each probe image: SS SF FS FF

Both algorithms successfully classify the image. Algorithm A successfully classifies the image, algorithm B fails. Algorithm A fails to classify the image correctly, algorithm B succeeds. Both algorithms fail to classify the image correctly.

The number of times each outcome is observed for all pairs of the four base distance measures is shown in Table 3.4b. The test we use to bound the probability of HO is called McNemar's Test, which actually simplifies to a Sign Test 5 . We review the testing procedure in Appendix 3.5, Section 3.5. The probability bounds for each

48

Yambor, Draper and

Beveridge

Table 3.4. Results of testing for statistical significance in the comparison of the four base distance measures: a) treating probe images as independent samples for each measure and testing significance over all samples, b) scoring paired tests into four outcomes and performing a Sign Test on only those where performance differs. Measures LI L2 LI Angle LI Mah. L2 Angle L2 Mah. Mah. Angle

Variable z 0.777 0.387 2.810 0.390 3.584 3.196

PHO<

0.21848 0.34924 0.00247 0.34826 0.00017 0.00070

(a)

Measures LI L2 LI Angle Mah. LI Angle L2 Mah. L2 Mah. Angle

SS 219 214 220 224 214 225

Outcomes SF FS 34 20 39 32 85 33 22 15 25 91 80 21

FF 449 437 384 461 392 396

PHO<

0.038 0.238 9.2E-07 0.162 2.7E-10 1.4E-09

(b)

pairwise comparison are shown in the last column of Table 3.4b. Observe how much more conclusively the null hypothesis can be rejected for those pairs including Mahalanobis distance. Also observe that the LI L2 comparison is now technically significant if a 0.05 cutoff is assumed. In other words, HO would be rejected with PHO < 0.038. What we are seeing is that this test is much more sensitive to differences in the possible outcomes of the two algorithms. Which test is preferable depends on the question being asked. In the first case, the overall difference in performance on all the data is taken into consideration. This is appropriate if the goal is to draw conclusion about performance over all the problems. However, if the goal is to spotlight differences in performance where such differences arise, the second is more appropriate.

Analyzing PC A-based Face Recognition Algorithms

49

Pig, 3,2. Eigenvectors 1-10, 50, 100, 150, 200 and 500

8.4. Selecting Eigenvectors In the FERET database, images may vary because of differences in illumination, facial expressions clothing 3 , presence and/or style of glasses, and even small changes in viewpoint, none of which are relevant to the task of identifying the image subject. The problem, of course, is knowing which Eigenvectors correspond to useful information and which are simply meaningless variation. By looking at the images of specific Eigenvectors, it is sometimes possible to determine what features are encoded in that Eigenvector. Images of the Eigenvectors used in the FERET evaluation are shown in Figure 3, ordered by Eigenvalue. Eigenvector 1 seems to encode lighting from right to left, while Eigenvector two apparently encodes lighting from top to bottom. Since the probe and gallery images of a single subject may not have the same lighting, it is reasonable to assume that removing these Eigenvectors might improve performance. Results from the FERET evaluation 9 verify this assumption. Other Eigenvectors also appear to encode features. For example, Eigene

In the FERET database, background, clothing and hair are eliminated as much as possible during image registration. Unfortunately, shirt collars and a few other effects (such as shadows cast by some hair styles) remain as meaningless sources of variation.

50

Yambor,

Draper

and

Beveridge

1000 e=88.15% e=85.11%v 900

e=80.42% e=77.64*

800

e=92

^9% >

e=94.56%e=97y;50% e=100.00%( y / e=99J3%

=F

s=0.0142^

s=0.00600/ s=O.0O235 1 s=0.00070 4=0.00351 s=0.00123 s=0.00000

e=73. e=67.71%

,2> 700

Q 600 3 500

I™ 130° -s=0.22964 e=42.76%

/

200 100 0

5

10

20

30

40

50

75

100

150

200

300

400

500

N u m b e r of Eigenvectors F i g . 3.3.

T h e Energy a n d Stretching dimensions on t h e F E R E T d a t a .

vector 6 clearly shows glasses. As we examine the higher order Eigenvectors (100, 150, 200, 500), they become more blotchy and it becomes difficult to discern the semantics of what they are encoding. This indicates that eliminating these Eigenvectors from the Eigenspace should have only a minimal effect on performance. In the FERET evaluation, the first 200 Eigenvectors were used (with the LI distance metric) to achieve optimal performance 9 . Removing specific Eigenvectors could in fact improve performance, by removing noise. 3.4.1. Removing

the Last

Eigenvectors

The traditional motivation for selecting the Eigenvectors with the largest Eigenvalues is that the Eigenvalues represent the amount of variance along a particular Eigenvector. By selecting the Eigenvectors with the largest Eigenvalues, one selects the dimensions along which the gallery images vary

Analyzing

PCA-based Face Recognition

Algorithms

51

the most. Since the Eigenvectors are ordered high to low by the amount of variance found between images along each Eigenvector, the last Eigenvectors find the smallest amounts of variance. Often the assumption is made that noise is associated with the lower valued Eigenvalues where smaller amounts of variation are found among the images. There are three variations of deciding how many of the last Eigenvectors to eliminate. The first of these variations removes the last 40% of the Eigenvectors 9 . This is a heuristic threshold selected by experience. The second variation uses the minimum number of Eigenvectors to guarantee that energy e is greater than a threshold. A typical threshold is 0.9. If we define e, as the energy of the ith Eigenvector, it is the ratio of the sum of all Eigenvalues up to and including i over the sum of all the Eigenvalues:

E J - i A .•3 Kirby defines e; as the energy dimension 6 . The third variation depends upon the stretching dimension, also defined by Kirby 6 . The stretch s, for the ith Eigenvector is the ratio of that Eigenvalue over the largest Eigenvalue (Ai): Si = £ -

(3-3)

Ai

Typically, all Eigenvectors with Si greater than a threshold are retained. A typical threshold is 0.01. An example of the Energy and Stretching dimensions of the FERET data can be seen in Figure 4. Specifying the cutoff point beyond which Eigenvectors are removed in terms of a percent of the total is sensitive to the gallery size and insensitive to the actual information present in the Principal Components. Either of the other two measures, energy or stretching, ought to provide a more stable basis for assigning the cutoff point.

3.4.2. Removing

the First

Eigenvector

It is possible that the first Eigenvectors encode information that is not relevant to the image identification/classification, such as lighting. Another variation removes the first Eigenvector 9 .

52

3.4.3. Eigenvalue

Yambor, Draper and

Beveridge

Ordered by Like-Image

Difference

Ideally, two images of the same person should project to the same point in Eigenspace. Any difference between the points is unwanted variation. On the other hand, two images of different subjects should project to points that are as widely separated as possible. To capture this intuition and use it to order Eigenvectors, we define a like-image difference u; for each the m Eigenvectors. To define u, we will work with pairs of images of the same people projected into Eigenspace. Let X be images in the gallery and Y images of the corresponding people in the probe set ordered such that Xj G X and Uj 6 Y are images of the same person. Define u>i as follows: m x u>i = -^- where S — 2_. \xj — Vj\

(3-4)

3= 1

When the difference between images that ought to match is large relative to the variance for that dimension Aj, then w, is large. Conversely, when the difference between images that ought to match is small relative to the variance, u>i is small. Since our goal is to select Eigenvectors, i.e. dimensions, that bring like images close to each other, we rank the Eigenvectors in order of ascending Wj. For each of the 1,195 probe/gallery matches in the FB probe set of the original FERET dataset, we calculate LJ and reorder the Eigenvectors accordingly. The top N Eigenvectors are selected according to this ranking, and the FB probe set was reevaluated using the LI distance measure. Figure 3.4 shows the performance scores of the reordered Eigenvalues compared to the performance of the Eigenvectors ordered by Eigenvalue, as performed by Moon & Phillips 9 . Reordering by the like-image difference improves performance for small numbers of Eigenvectors (up to the first 40). This suggests that the likeimage difference should be used when selecting a small number of Eigenvectors. However, there is a problem with the methodology employed in this test. The test (probe) images were used to develop the like-image difference measure us. A better experiment would reserve a subset of the test imagery for computing w, and then record how well the classifier performed on the remaining test images. This improved methodology is used in the next section.

Analyzing PCA-based Face Recognition Algorithms

53

1000

_*^—*—"**—-

900

•

.

r S700 m/

g 600 500

400 £ 300

E

a Z 200 100 j —•— Ordered by Eiegenvalues

---•- Ordered by Like-Image Difference 1

5

10

20

30

40

50

75

100

1

150

1

200

1

300

|

1

400

500

Number of Eigenvalues Fig. 3.4. Performance when ordering by Eigenvalue versus Like-Image Difference.

3.4.4. Variation Sets

Associated

with Different

Test/Training

All of the results above use a single Eigenspace developed using 500 gallery images. One question to raise is how sensitive the results are to the choice of imagery used to construct the Eigenspace and carry out the testing. Here, we take up this question, and pursue whether the differences in performance between algorithm variations are significant relative to variation resulting from computing new Eigenspaces based upon the available training images. To allow us to run experiments with alternative assignments of imagery to training and test sets, we restructured the FERET dataset so that there were four images of each individual. The resulting dataset consisted of images of 160 individuals. Four pictures are stored of each individual, where two of the pictures were taken on the same day with different facial expressions (one of the expressions is always neutral). The other two pictures

54

Yambor, Draper and

Beveridge

were taken on a different day with the same characteristics. In this experiment several factors are varied. The standard recognition system is run with each of the four distance measures. Both large and small Eigenspaces are considered: the small space keeping only the first 20 Eigenvectors and the large discarding the last 40% (keeping 127). Both the large and small space are created using the standard order, where Eigenvectors are sorted by descending Eigenvalue, and our variant, where Eigenvectors are sorted by like-image difference. Finally, 10 trials are run for each variant using a different random assignment of images to the training (gallery) and test (probe) sets. More precisely, for each of the 160 individuals, two images taken on the same day are randomly selected for training data. Of the 160 individuals, select 20 to compute the like-image difference. For these 20, also include a third image of the individual when computing the like-image difference. For the remaining 140 individuals, select one of the remaining two images at random to construct the test image set. In summary, 320 images are used in each trial, 180 for training and 140 for testing. Only the like-image difference algorithm actually uses the images of the 20 individuals set aside for this purpose, the algorithms that sort by Eigenvalue simply ignores these. The number of correctly classified images for each of the variants in this experiment are reported in Table 3.5. Table 3.5a gives results for the case where the last 40% of the Eigenvectors are discarded. This case is subdivided into two parts. In the first, the Eigenvectors are sorted by decreasing Eigenvalue, the standard case, and the second where the Eigenvectors are sorted by decreasing like-image difference as defined above. For both sorting strategies, the standard PCA nearest neighbor classifier is run using each of the for distance measures: LI, L2, Angle and Mahalanobis. Table 3.5b gives the analogous results for the case where fewer Eigenvectors are retained: only the first 20. Let us make three observations based upon the data in Table 3.5. Observation 1 There is no apparent improvement when Eigenvectors are sorted by the like-image distance measure. In fact, the average number of correctly classified images drops slightly in 7 out of the 8 cases. However, the net differences in these averages is very small, being less then 1 in half the cases, less than 2 in two cases, and less than 4 in the last two.

Analyzing

PCA-based Face Recognition

Algorithms

55

Table 3.5. Number of correctly classified images, out of 140, for different algorithm variations. Each row gives results for a different random selection of training and test data, a) Discard last 40% of the Eigenvectors, b) Keep only the first 20 Eigenvectors. T 1 2 3 4 5 6 7 8 9 10

L2 59 61 54 53 62 50 55 61 59 63 57.7

Standard Order LI A M 69 67 89 70 62 82 76 71 89 73 67 83 71 58 83 72 61 89 75 66 91 67 61 91 71 60 86 73 66 92 71.7 63.9 87.5 (a)

L2 63 60 59 61 54 64 63 53 56 61 59.4

Sig. Diff. Order LI A M 68 65 85 68 62 83 76 70 86 71 66 84 70 59 84 67 61 79 72 66 85 69 61 83 72 59 79 73 66 86 70.6 63.5 83.4

T 1 2 3 4 5 6 7 8 9 10

L2 46 50 54 46 42 49 53 45 55 43 48.3

Standard Order LI A M 46 52 55 46 50 56 47 58 48 46 46 52 42 51 50 41 55 48 46 44 49 45 50 45 41 49 51 48 48 52 44.8 50.3 50.6 (b)

L2 47 47 49 55 44 41 49 50 45 53 48.0

Sig. Diff. Order LI A M 45 53 56 44 49 53 43 56 53 47 45 42 44 52 50 42 56 45 46 44 45 44 48 47 41 49 45 49 46 47 44.5 49.8 48.3

O b s e r v a t i o n 2 Recognition rates are significanlty higher for the larger Eigenspace. This is consistent with prior studies 9 . But the results are compelling and here we can compare the variations associated with the larger Eigenspace with variations arising out of different sets of gallery and probe images. Looking at the standard sorting case, there are four comparisons possible, one for each distance measure. Using a one sided paired sample t test, PHO is less than or equal to 3.0a;10~3, 3.4a;10 -10 , 2.4zl0 _ 5 and 1.3a;10~8 for the LI, L2, Angle and Mahalanobis distances respectively. In other words, the probability of the null hypothesis is exceedingly low, and it may be rejected with very high confidence.

56

Yambor, Draper and

Beveridge

Observation 3 Results are much better using Mahalanobis distance in the case of the larger number of Eigenvectors, but not when only the first 20 Eigenvectors are used. When using the top 60% of the Eigenvectors sorted by Eigenvalue, the comparison of Mahalanobis to LI, L2 and Angle using a one sided paired sample t test yields Puo is less than or equal to 3,3xl0~ 8 , 5 , 2 x l 0 - 7 and 2.2xl0 - 8 . However, using only the first 20 Eigenvectors, the t test comparing Mahalanobis to LI and Angle yields PHO is less than or equal to 0.13 and 0.17 respectively. In these two latter cases, the null hypothesis cannot be rejected and no statistically meaningful difference in performance can be inferred. This makes intuitive sense, since the relative differences in eigenvalues are less when only the highest twenty are used, making the Mahalanobis more similar to LI and Angle. 3.5. Conclusion Using the original FERET testing protocol, a standard PCA classifier did better when using Mahalanobis distance rather than LI, L2 or Angle. In a new set of experiments where the the training (gallery) and testing (probe) images were selected at random over 10 trials, Mahalanobis was again superior when 60% of the Eigenvectors were used. However, when only the first 20 Eigenvectors were used, L2, Angle and Mahalanobis were equivalent. LI did slightly worse. Our efforts to combine distance measures did not result in significant performance improvement. Moreover, the correlation among the LI, L2, Angle and Mahalanobis distance measures, and their shared bias, suggests that although improvements may be possible by combining the LI measure with other measures, such improvements are likely to be small. We also compared the standard method for selecting a subset of Eigenvectors to one based on like-image similarity. While the like-image method seems like a good idea, it does not perform better in our experiments. More recent work suggests this technique works better than the standard when used in conjunction with Fischer Discriminant analysis, but these results are still preliminary. The work presented here was done primarily by Wendy Yambor as part of her Masters work 17 . At Colorado State, we are continuing to study the relative performance of alternative face recognition algorithms. We have two goals. The first being to better understand commonly used algorithms. The

Analyzing

PCA-based Face Recognition

57

Algorithms

second, larger goal, is to develop a more mature statistical methodology for the study of these algorithms and others like them. This more recent work is being supported by the DARPA Human Identification at a Distance Program. As part of this project, we are developing a web site intended to serve as a general resource for researchers wishing to compare new algorithms to standard algorithms previously published in the literature. This web site may be found at: h t t p : / / w w w . c s . c o l o s t a t e . e d u / e v a l f a c e r e c / . Appendix A:Distance Measures The LI, L2, Angle and Mahalanobis distances are defined as follows: L I City Block Distance = Y^\xi-yi\

d(x,y) = \x-y\

(3-5)

L2 Euclidean Distance (Squared) k

d(x, y) = \\x - y\\2 = £ > ; -

2

Vi)

(3.6)

Angle Negative Angle Between Image Vectors

d(x,y) = -TT^fii = _

Eti^y.====

(37)

Mahalanobis Mahalanobis Distance k

d(x,y) =

1

-^2^=Xiyi

(3.8)

Where A; is the ith Eigenvalue corresponding to the ith Eigenvector. This is a simplification of Moon's definition: fc

d(x,y) = ~y^ZiXiyi

U

/—x— VAl + a2

i ^

where Zi = \ -—%—-1r~—= and a = 0.25 (39)

Our original experiments with the definition in 9 yielded poor results, hence our adoption of the definition in equation 3.8.

Yambor, Draper and Beveridge

58

Appendix B: Statistical Tests Large Sample Inference Proportions

Concerning

Two

Population

Assume the probe images are drawn from a population of possible images. Let 7r be the ratio of solvable images over the total number of images in this population. The observed proportion p of problems solved on the sample (probe) images is an estimate of ir. When comparing results using two algorithms A and B, the null hypothesis # 0 is that nA — KB- Following the development in the section "Large-Sample Inferences Concerning A Difference Between Two Population Proportions" in 3 , the probability of HO is determined using a standardized variable z: PA-PB

,

2p,(l-Pc)

PA+PB

where pc =

,,

n

_.

(3.10)

/ PA and PB are the observed proportions of successes in the sample (probe) images for algorithms A and B, and n is the total number of images. The standardized variable is Gaussian with 0 mean and standard deviation 1. When performing a one sided test, i.e. testing for the case ITA > KB, the probability of HO is bounded by: HO

Paired

Success/Failure

r°° i _** < / -=e—dx

Trials: McNemar's

(3.11)

Test

McNemar's test ignores those outcomes where the algorithms do the same thing: either SS or FF. For the remaining outcomes, SF and FS, a Sign Test is used. The null hypothesis HO is that the probability of observing SF is equal to that of observing FS is equal to 0.5. Let a be the number of times SF is observed and b the number of times FS is observed. We are interested in the one sided version of this test, so order our choice of algorithms so a > b and assume HI is that algorithm A fails less often then B. Now, the probability of the null hypothesis is bounded by b

Pm < V

,

.,, n " . N ,0.5 n where n = a + b

(3.12)

Analyzing PCA-based Face Recognition Algorithms

59

Acknowledgments We t h a n k the National Institute of S t a n d a r d s and Technology for providing us with the results and images from the F E R E T evaluation. We t h a n k Geof Givens from the Statistics Department a t Colorado S t a t e University for his insight in pointing us toward McNemar's Test. This work was supported in p a r t t h r o u g h a grant from the D A R P A H u m a n Identification at a Distance P r o g r a m , contract DABT63-00-1-0007.

References 1. Peter Belhumeur, J. Hespanha, and David Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):771 - 720, 1997. 2. L. Breiman. Bagging predictors. Technical Report Technical Report Number 421, Dept. of Statistics, University of California, Berkeley, 1994. 3. Jay Devore and Roxy Peck. Statistics: The Exploration and Analysis of Data, Thrid Edition. Brooks Cole, 1997. 4. T. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrection output code. Journal of Artificial Intelligence Research, 2:263 286, 1995. 5. IFA. Statistical tests, h t t p : / / f o n s g 3 . l e t . u v a . n l : 8 0 0 1 / s e r v i c e / s t a t i s t i c s . html). Website, 2000. 6. M. Kirby. Dimensionality Reduction and Pattern Analysis: an Empirical Approach. Wiley (in press), 2000. 7. E. B. Kong and T. Dietterich. Why error-correcting output coding workds with decision trees. Technical report, Dept. of Computer Science, Oregon State University, Corvalis, 1995. 8. M. Kirby and L. Sirovich. Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces. IEEE Trans, on Pattern Analysis and Machine Intelligence, 12(1):103 - 107, January 1990. 9. H. Moon and J. Phillips. Analysis of pea-based face recognition algorithms. In K. Boyer and J. Phillips, editors, Empirical Evaluation Techniques in Computer Vision. IEEE Computer Society Press, 1998. 10. J. Phillips, H. Moon, S. Rizvi, and P. Rauss. The feret evaluation. In H. Wechslet, J. Phillips, V. Bruse, F. Soulie, and T. Hauhg, editors, Face Recognition: From Theory to Application. Springer-Verlag, Berlin, 1998. 11. J. Phillips, H. Moon, S. Rizvi, and P. Rauss. The feret evaluation methodology for face-recognition algorithms. Technical Report Technical Report Number 6264, NIST, 1999. 12. William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling. Numerical Recipes in C. Cambridge University Press, Cambridge,

60

Yambor, Draper and Beveridge

1988. 13. Shree K. Nayar, Sameer A. Nene and Hiroshi Murase. Real-Time 100 Object Recognition System. In Proceedings of ARPA Image Understanding Workshop. Morgan Kaufmann, 1996. http://www.cs.columbia.edu/CAVE/rtsensors-systems.html. 14. L. Sirovich and M. Kirby. A low-dimensional procedure for the characterization of human faces. The Journal of the Optical Society of America, 4:519 524, 1987. 15. D. Swets and J. Weng. Using discrimant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831836, 1996. 16. D. Swets and J. Weng. Hierarchical discriminant analysis for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):386401, 1999. 17. Analysis of PCA-Based and Fisher Discriminant-Based Image Recognition Algorithms, Wendy S. Yambor, M.S. Thesis, July 2000 (Technical Report CS-00-103, Computer Science).

CHAPTER 4

Design of a Visual System for Detecting Natural Events by the Use of an Independent Visual Estimate: A Human Fall Detector

P. A. Bromiley, P. Courtney and N.A. Thacker Imaging Science and Biomedical Engineering, Stopford Building, University of Manchester, Oxford Road, Manchester, M13 9PT E-mail: [email protected]

We describe the development of a vision system to detect natural events in a low-resolution image stream. The work involved the assessment of algorithmic design decisions to maximise detection reliability. This assessment was carried out by comparing measures and estimates made by the system under development with measures obtained independently. We show that even when these independent measures are themselves noisy, their independence can serve to guide rational design decisions and allow performance estimates to be made. Although presented here for one particular system design, we believe that such an approach will be applicable to other situations when an image-based system is to be used in the analysis of natural scenes, in the absence of a precise ground truth.

4.1. Introduction Performance evalution is esssential for providing a solid scientific basis for machine vision, and yet its importance is often understated. Current work in this area 1>2,3 has tended to emphasise the importance of an objective ground truth (see for example work in medical image registration 4 , face recognition 5 , photogrammetry 6 and graphics recognition 7 ) . We present a case study of the evaluation of a machine vision system, which exhibits two examples of performance characterisation in the absence of a suitable ground truth. We believe that our approach will be applicable to other situations when an image-based system is to be used in the analysis of natural scenes, operating in the absence of a precise ground truth. 61

62

Bromiley,

Courtney and

Thacker

The system described here is a human fall detector based on a novel infrared sensor with limited spatial resoultion. It was developed by an industrial/academic collaboration involving the University of Manchester, the University of Liverpool, IRISYS Ltd. and British Telecom Pic in the UK. It observed a natural scene of a person in their home, typically an elderly person living alone. When left undetected, falls amongst the elderly often lead to aggravated injuries, admission to hospital, and sometimes death. However, this kind of event is rather poorly defined and occurs randomly under normal circumstances, making the design of a system to detect it problematic. In order to be practically useful the system had to detect falls reliably whilst generating the minimum number of false alarms, making performance evaluation a vital part of this work. The detector operated by recognising the patterns of vertical velocities present during a fall by a human subject, and issuing a fall detection warning when such a pattern was detected in the output from the thermal sensor. The initial identification and the detection of these patterns was performed automatically using an MLP neural network, thus removing any requirement for in-depth study of the dynamics of human movement. Image data showing simulated fall and non-fall (e.g. sitting) scenarios were captured from both the infrared sensor and a colour CCD camera. Vertical velocity measurements were extracted from both sets of images, and the colour data were used as a gold standard to demonstrate a correlation between the infrared velocity estimates and the physical velocities present in the scene. Despite the fact that the velocity measurements extracted from the colour data did not represent a genuine ground truth, we show that this form of performance evaluation can be used to guide algorithmic development in an objective manner. Finally, a range of neural networks were trained on a sub-set of the infrared data and tested on the remainder, and the best-performing network was identified using ROC curves. We demonstrate that, using appropriate definitions of true and false detections, it is possible to evaluate the performance of a system for identifying events in a temporal data stream in this manner. 4.2. Approach The purpose of the fall detector was to monitor the image stream from the thermal detector for characteristic signals associated with falls in human subjects, and to issue a fall detection warning when such a motion was

Design of a Visual System for Detecting Natural

Events

63

observed. At the most basic level, the primary characteristic associated with falls is vertical downwards motion. Therefore the analysis focussed on measuring vertical velocities, and identifying features in the pattern of velocities over time that were characteristic of falls. Basic physics guarantees that, if the velocity of a subject moving around a room is resolved into its horizontal and vertical components, the vertical acceleration of a subject falling under gravity acts independently of any horizontal motion. Therefore, the analysis was restricted to studying vertical motion, and any horizontal component was discarded. The first requirement was to obtain realistic image data from which the characteristic motions associated with falls could be identified. Sequences of images showing an actress simulating a wide variety of fall and non-fall (e.g. sitting) scenarios were captured simultaneously from both the infrared sensor and a colour CCD video camera. Next, software was written to calculate the velocity of the actress both from the infrared image sequences and from the colour video. The approach taken to extracting velocity information from the colour video relied on the use of colour segmentation. During the simulations, the actress wore a shirt of a different colour to any other object in the scene. A colour segmentation algorithm was used to extract the shirt region from the colour video images, allowing the centroid of the actresses upper body to be calculated. The vector of these centroid positions over time could then be differentiated to obtain velocity measurements from the colour images. The approach taken to extracting velocity information from the infrared images was quite different, and exploited the basic physics of the detector itself. The infrared sensor used was a differential sensor i.e. it registered changes in temperature. Stationary objects in the scene were therefore ignored. Any moving object warmer than the background created two significant regions in the image. The first was a region of positive values covering the pixels that that the object was moving into, which were becoming warmer. This was trailed by a region of negative values covering the pixels that the object was moving out of, which were becoming colder. If the object was colder than the background, the signs of the two regions were reversed. In either case, a zero-crossing existed between the two regions that followed the trailing edge of the moving object. The approach taken was to track this zero-crossing to give measurements of the position of the actress in the infrared images, which could in turn be used to calculate the

64

Bromiley,

Courtney and

Thacker

velocity. The velocity estimates derived from the colour video data were used as a gold standard against which to compare the velocities calculated from the infrared images. It was therefore possible to calculate the extent to which the infrared velocity estimates were correlated with the physical velocities present in the scene. This correlation analysis was performed for a subset of around 2% of the data. Several methods were used to perform this analysis, including linear correlation coefficients, Gaussian fitting to the noise on the infrared measurements after outlier rejection, and the use of ROC curves. The latter provided a generic method for ranking the performance of various smoothing filters applied to the data. The use of two independent estimators of motion was a key part of this work. Although both used radiation from the scene, they were independent in that they sensed different kinds of information, and processed it in fundamentally different ways. The colour segmentation algorithm was based on the interaction of visible light and reflectance on clothing in spatial regions, whereas the algorithm for the infrared sensor was based on signal zero-crossings on thermal radiation using a completely different lens system. We would therefore expect the information in the two estimators to suffer from noise, bias and distortion independently. To produce the fall detector itself, an MLP neural network was trained to take temporal windows of velocity measurements from the infrared sensor and produce a fall/non-fall decision. The advantage of this approach was that the training process allowed the neural network to automatically identify the regions in the input pattern space that contained the fall data points i.e. the patterns of velocities characteristic of a fall. Therefore no in-depth, manual study of those patterns of velocities was required. A subset of the infrared velocities were extracted at random and used to train a neural network to perform the fall/non-fall classification. A number of neural networks were trained, varying all available parameters in order to find the best architecture for this problem. ROC curves were used to select the best network. A nearest neighbour classifier was applied to the same data. It can be shown that the nearest neighbour classifier approaches Bayes optimal classification performance for large data sets such as this, and so this gave an indication of how close the best-performing neural network approached to the optimum performance. Finally, the classification decision was made

Design of a Visual System for Detecting Natural

Events

65

based on individual velocity measurements, without using the neural network, in order to estimate how much of a performance gain had been realised through the application of a neural network to this temporal recognition problem. 4.3. Data Collection Data was collected to provide realistic video data on both fall and non-fall scenarios that could then be used in the construction of a fall detector. A list of types of fall (e.g. slips, trips etc.), together with non-fall scenarios that might generate significant vertical velocities (e.g. sitting), was prepared. An actress was employed to perform the scenarios and thus simulate falls and non-falls, and sequences of images were captured from both the infrared sensor and a colour CCD video camera. Each scenario was performed in six orientations of the subject and the cameras, giving a total of 84 fall and 26 non-fall sequences for each camera. In addition to performing the fall or non-fall, the actress raised and lowered her arm at the beginning and end of each sequence in order to provide a recognisable signal that could later be used to synchronise the colour and infrared image sequences. In order to simplify the simulations, they were performed at between four and six metres from the cameras, with the cameras held parallel to the floor in the room. This maximised the ability to distinguish vertical from horizontal motion. Twenty degree optics were used on both the colour and infrared cameras, further simplifying the geometry of the scene through the foreshortening effect of narrow angle optics. The infrared data were recorded at 30fps in a proprietary format. Colour data were recorded at 15fps as RGB values (i.e. no codec was used) in the AVI format. Interpolation was later used to increase the effective frame rate of the colour data to match the infrared data. 4.4. Velocity Estimation The extraction of velocity information from the images recorded during the simulations focussed on three areas: • estimating velocities from the colour video images; • estimating velocities from the differential infrared images; • measuring the correlation between them.

Bromitey, Courtney and Thacker

Fig. 4.1. A frame of colour video.

The following sections give brief summaries of the methods adopted in each of these areas. 4.4.1. Colour Segmentation

and Velocity

Estimation

The extraction of velocities from the colour images relied on the use of a colour segmentation routine 8 . Fig. 1. shows an example of a single frame of colour video taken from one of the falls. During the simulations, the actress wore a shirt of a different colour to any other object in the scene. This allowed the colour segmentation routine to extract and label the pixels in the shirt region. Following this, the co-ordinates of the pixels in that region were averaged to calculate the centroid of the actresses upper body, giving a measurement of her position in the scene. The vector of these measurements over the course of an entire sequence of images could then be differentiated to produce a vector of velocities. Only the vertical component of the velocity was calculated. The image segmentation algorithm is described in more detail elsewhere 8 but a brief description is included here for completeness. The ap-

Design of a Visual System for Detecting Natural

Events

67

proach adopted relied on the clustering of pixels in feature space. The subject of clustering, or unsupervised learning, has received considerable attention in the past 9 , and the clustering technique used here was not original. However, much of the work in this area has focussed on the determination of suitable criteria for defining the "correct" clustering. In this work a statistically motivated approach to this question was adopted, defining the size required for a peak in feature space to be considered an independent cluster in terms of the noise in the underlying image. This maximised the information extracted from the images without introducing artefacts due to noise, and also defined an optimal clustering without the need for testing a range of different clusterings with other, more subjective criteria. The segmentation process worked by mapping the pixels from the original images into an n-dimensional grey-level space, where n was the number of images used, and defining a density function in that space. A colour image can be represented as three greyscale images, showing for instance the red, green and blue components of the image, although many alternative threedimensional schemes have been proposed 10>11. Therefore a colour image will generate a three-dimensional grey-level space, although the algorithm can work with an arbitrary number of dimensions. An image showing a number of well-defined, distinct colours will generate a number of compact and separate peaks in the grey-level space, each centered on the coordinates given by the red, green and blue values for one of the colours. The algorithm then used the troughs between these peaks as decision boundaries, thus classifying each pixel in the image as belonging to one of the peaks. Each peak was given a label relating to the number of pixels assigned to it, and an image of these labels was generated as output. In practice, illumination effects can spread out or even divide the peaks in colour space. For example, an object of a single colour may be partially in direct light and partially in shadow, qand so the shadowed and directly lit regions of the object appear to have different colours. In the RGB colour scheme each component contains both chromatic and acromatic components. The object will therefore generate two peaks in colour space and be segmented as two separate regions. This is not a failure of the algorithm, since the separation of the object into several regions preserves the information present in the image, but it was undesirable in the current situation where the intention was to label the shirt as a single region regardless of illumination effects. Therefore the achromatic information was removed from

68

BromiUty, Courtney and

Fig. 4.2.

Thacker

A frame of colour video - the saturation (left) and hue (right) fields.

the images prior to segmentation, by converting from the RGB colour space to the HSI colour space, which separates the chromatic information in the hue and saturation fields from the achromatic information in the intensity field. The intensity field was discarded and the segmentation was performed on the hue and saturation fields. This had the additional advantage of reducing the dimensionality of the problem from three to two, reducing the processor time required. Fig. 2 shows the hue and saturation fields for the frame of colour video shown in Fig. 1, and Fig. 3 shows the scattergram of

Design of a Visual System for Detecting Natural Events

m

Fig, 4.3. . The scattergram for the hue and saturation fields shown in Fig. 2, corresponding to the colour video frame shown in Fig.l. The peak in the upper-right corresponds to the green shirt.

these two fields. The pixels correpsonding to the shirt form a well-defined, compact cluster in hue-saturation space. Since the actress wore a shirt of a single, uniform colour, it wag labelled as a single region. A thresholding algorithm could then be used to identify the pixels covering the shirt region, and the centroid of the shirt was calculated by averaging the pixel co-ordinates. Fig. 4 shows the outputs from the colour segmentation algorithm. The vertical component of the centroid's velocity was then calculated by taking differences between the vertical position in neighbouring frames, thus producing a velocity in units of pixels per frame. The frame rate of the colour video was ISfps, whereas the frame rate of the infrared data was 30fps. Since the colour video provided much more accurate data, and to avoid discarding data, extra data points were generated for the colour video by interpolating the centroid positions between each pair of frames before the velocity calculation. Finally, since the colour segmentation algorithm was very processor intensive, the velocity calculation procedure was only applied to a subset of 26 of the fall video sequences from the simulations, and to a window of 30 frames centered around the fall in each video. This generated 59 positional data points for each video sequence when the interpolation was applied, and 58 velocity data points.

70

Bromiley, Courtney and Thacker

Fig, 4.4. Segmentation of a frame of colour video (left), and the result of thresholding to extract the shirt region (right).

4.4.2. IR Velocity

Estimation

The approach taken with the infrared images relied on the differential nature of the detector. Since it was sensitive to changes in temperature, only moving objects at a different temperature to the background were detected. Therefore, the analogue of the colour segmentation task was performed by the detector itself. As mentioned above, any moving object at a higher temperature than the background generated two regions in the images: a region of positive values covering the pixels into which the object

Design of a Visual System for Detecting Natural

Events

71

(a) Single IR video frame.

(b) The "crushed" image.

(c) Detail from (b). Fig. 4.5. A single frame of IR video (a), showing trie positive (white) and negative (black) regions. The crushed image for this fall sequence (b) shows the extended period of activity in the middle of the sequence corresponding to the fall itself, together with regions of activity before the fall, corresponding to the actress waving her axm as a synchronisation signal, and after the fall, corrsponding t o the actress standing up. The region from the middle of the crushed image is shown in detail in (c), with the zerocrossings marked as white points.

was moving, trailed by a region of negative values covering the pixels out of which the object was moving. The zero-crossing between these two regions

72

Bromiley,

Courtney and

Thacker

followed the trailing edge of the object. In order to remove any horizontal component of the movement, the 16x16 pixel infrared images were summed across rasters to produce a 16x1 pixel column vector. This introduced several additional advantages, providing an extremely compact method for storing the image information (the "crushed" image, produced by stacking together the column vectors from an image sequence) and reducing the effects of noise. The positions of the zero-crossings were identified initially by a simple search up the column vectors, and were refined by linear interpolation. This produced a vector of approximations to the vertical position of the actress across a sequence of images, which was differentiated to produce a vector of vertical velocities. Fig. 5a shows a single frame from the IR video, taken simultaneously with the colour video frame shown in Fig. 1. Fig. 5b shows the result of stacking together the column vectors for all of the IR frames taken during this fall, the "crushed" image. Fig. 5c shows the region of the crushed image covering the fall itself, with the zero-crossing marked as white points.

4.4.3. Velocity

Correlation

The extent of the correlation between the velocities extracted from the colour images and those extracted from the infrared images was measured using a number of different techniques. The simulation data consisted of 108 data sets, covering a variety of both fall and non-fall scenarios, recorded on both a colour CCD camera and the infrared sensor. In order to test the velocity calculation routines a sub-set of 26 of these data sets, covering the whole range of fall scenarios, were selected and plotted against one another, as shown in Fig. 6. The units of velocity in this plot are pixels per frame, where a frame corresponds to 1/30 th of a second, but it must be remembered that there are 240 pixels along the y-axis of the colour video images, and only 16 along the y-axis of the infrared images. As can be seen, there is some correlation, but it is somewhat poor. This was expected given the exacerbation of noise in the position data by the differentation used to calculate the velocity. Therefore methods of combining data within the temporal stream were studied.

Design of a Visual System for Detecting Natural Events

73

•1 0.75*

fig. 4.8. The raw velocity data, showing infrared ¥e!ocity against colour video velocity. The tmita of both axes are pixels per frame, where one frame corresponds to 1/30 th of a second, but the definition of a pixel is different for the two axes.

4,4.4, Data

combination

Smoothing functions were applied in order to improve the correlation. A variety of smoothing techniques were tested in order to find the best technique for this data set. These included: 9 a five-point moving window average, which replaced each data point with the result of averaging the five paints centered around that data point; * a five-point median filter, which took the same window of data and replaced the central data point with the median of the five values; » a combination of the two which was termed a median rolling average (MRA) liter which took a window of five data points centered on the point of interest, dropped the highest and lowest values, and then averaged the remaining three points. Each of these techniques had advantages. The median filter tended to be more robust to outliers, whereas the moving window average was verysusceptible to outliers, but had a stronger smoothing effect in the absence

74

Bromiley,

Courtney and

Thacker

Fig. 4.7. The velocity d a t a after 5 point moving window average, showing infrared velocity against colour video velocity.

Fig. 4.8. The velocity d a t a after 5 point median average, showing infrared velocity against colour video velocity.

of outliers. The MRA filter combined the advantages of each, providing stronger smoothing whilst retaining resistance to outliers. Figs. 7-9 show the result of plotting the processed estimates from the thermal sensor against the velocity estimates derived from the colour pro-

Design of a Visual System for Detecting Natural

Events

75

uw^y -o.4ff -0.S

!/• • > .

•

-0.W i

—*»

•

Fig. 4.9. The velocity data after 5 point median/ 3 point moving window average, showing infrared velocity against colour video velocity.

cessing. Several facts were immediately obvious from these plots. It is clear that the smoothed data produced tighter distributions than the unsmoothed data, but none of the three smoothing methods had an obvious advantage from simple visual inspection of the plots. The plots for the three smoothing methods had a typical shape: the velocity estimates from the infrared images were directly proportional to those from the colour video up to velocities of approximately 2 pixels per frame on .the x-axis of the plots (corresponding to the colour video velocity estimates), and then they flattened out. The infrared velocity at which this occured was around 0.19 pixels, approximately the value that would be expected from the ratio of pixel sizes in the two image types. Above this point, the infrared velocity estimates no longer increasd with colour video velocity estimate. This was due to the basic physics of the detector itself, rather than the methods used to extract velocity estimates from the data. The thermal sensor took several seconds to saturate, and this placed an upper limit on the rate of temperature change it could detect. The colour video velocity estimates had a maximum velocity of around 6 pixels per frame, corresponding to around 2.25 metres per second. The infrared velocity estimates flattened at around one third of this value, and so the maximum velocity measureable for the given geometry was 0.75 metres per second. Therefore, the sensor design discarded approximately two thirds of the dynamic range of velocities

Bromiley,

76

Courtney and

Thacker

present during typical fall scenarios. Following consultation with the sensor manufacturer, it emerged that this may have been due to the preprocessing performed by the sensor itself, which has since been modified. In order to determine which smoothing method was the most effective for this data, the correlations between the infrared and colour video velocity estimates were measured in several ways. Firstly, a simple linear correlation coefficient was calculated for the whole data set for each smoothing option, and the results are given in Table 1. This was not in itself particularly informative, since it assumed a simple linear relationship between the two variables studied, and furthermore was highly sensitive to outliers. The correlation coefficients showed that smoothing the data gave a better correlation than no smoothing, but did not show a significant difference between the three smoothing methods. Therefore, a more complex technique was applied. As mentioned above, the infrared velocity data reached a plateau above a value of around 2 pixels per frame on the x-axis (colour video velocity estimate) and 0.19 pixels per frame on the y-axis (infrared velocity estimate), and did not increase further with colour velocity. Any difference in the infrared velocity for data points above this threshold therefore corresponded only to noise. The data from this region were projected onto the y-axis (i.e. the x-coordinates were

Table 4.1.

Statistics for the four smoothing methods used.

Smoothing method None 5 pt Average 5 pt Median M.R.A.

Correlation 0.176 0.201 0.204 0.199

Inliers 178 239 228 239

Outliers 133 72 83 72

M 0.178 0.193 0.168 0.212

a 0.129 0.098 0.110 0.098

discarded) and the mean and standard deviation were calculated, assuming a normal distribution. Since these calculations were also sensitive to outliers, the outlying data points were identified with a simple visual inspection and discarded. The grey points on the graphs show the data points selected for these calculations: the black points are the discarded data points. The number of inliers (grey points) and outliers (black points above the 2 pixels per frame threshold) were also counted. The results are again given in Table 1.

Design of a Visual System for Detecting Natural

Events

77

Inspection of the data given in Table 1 shows that the median rolling average filter gave the best performance of the four smoothing options (three smoothing methods plus no smoothing). It had the highest inlier/outlier ratio and lowest standard deviation (although the difference between the median rolling average and moving average was largely insignificant), showing that it produced a tighter distribution and had the largest noise-reducing effect of the three methods. It also produced the highest mean, and so gave more separation of fall data points from the noise. In order to conclusively demonstrate the superiority of the median rolling average filter over the other filters tested for this data set, ROC curves were plotted for the four smoothing options. A threshold was specified on the color velocities to arbitrarily split the data into "fast" and "slow" velocities. This threshold was set at 2 pixels per frame, the point at which the infrared velocity estimates stopped increasing with colour velocity estimate, and therefore the threshold which gave the maximum ability to differentiate high velocities from noise. Then a second, varying threshold was applied to the infrared velocity estimates, and was used to make a fast/slow decision based on the infrared data. The points above this second threshold therefore fell into two groups. Firstly, those defined as fast using the colour data, i.e. those with x values higher than 2 pixels per frame, represented correct decisions based on the infrared data. Secondly, those defined as slow using the colour data represented incorrect decisions. The number of data points in each of these categories was counted, and a percentage was calculated by dividing by the total number of points either above or below the threshold applied to the colour data. The two probabilities calculated in this way produced a point on the ROC curve, and by varying the threshold applied to the infrared velocity the whole ROC curve was plotted for each smoothing option. The ROC curves therefore provided a generic method for calculating the correlations between the colour and infrared velocity estimates, by measuring how often they agree or disagree on a data point being fast for some arbitrary definitions of "fast" and "slow". The resulting graph is shown in Fig. 10. The best smoothing option was the one whose ROC curve most closely approached the point (100,0) on the graphs, and it is clear that the MRA filter was the best for this data set. Applying no smoothing was shown, as expected, to be the worst choice. The median rolling average was always

Bromiley,

20

Courtney and

40

60

Thacker

BO

100

(a)

(b) Fig. 4.10. ROC curves (showing percentage false acceptance rate plotted against percentage true acceptance rate) for the four velocity calculation methods (a) and a detail from this plot (b).

better than the simple average, due to its ability to reject outliers. At very low correct decision rates the median average proved better than the median rolling average due to its superior resistance to outliers. However,

Design of a Visual System for Detecting Natural

Events

79

the median rolling average was clearly better than the other smoothing choices in the regime in which the final classifier would operate.

4.4.5.

Conclusions

The statistical analysis of the results from the four smoothing options clearly showed that the median rolling average was the best smoothing filter for this data set. Several other conclusions can also be drawn from the data presented. Firstly, the time taken for the detector to saturate placed an upper limit of around 0.2 pixels per frame, or 6 pixels per second, on the dynamic range of velocities that could be detected. Given this information, the effects of changes to the detector design can be deduced. Fitting a Gaussian to the infrared velocity measurements in the flat part of the velocity curves produced a mean of approximately 0.2 pixels per frame and a standard deviation of around 0.1. It is therefore clear that, in order to separate the velocities measured during a fall from the noise on measurements of zero velocity, a resolution of around 0.1 pixels per frame is required, and furthermore that this was being achieved by the current method. It is probable that the correlation between the velocities calculated from the infrared images and the physical velocities present in the scene was better than the correlation between the velocities calculated from the infrared and colour video. The two velocity extraction techniques measured slightly different quantities. In the case of the colour video, the actresses upper body was segmented from the scene and its centroid calculated. The limbs and head were ignored, and so the calculated velocities corresponded closely to the movement of the centre of gravity. In contrast, the infrared images measured the movements of all body parts. As an example, if the actress waved her arms whilst otherwise standing still this movement was detected in the infrared images but not in the colour video. The aim of this work was to demonstrate the feasibility of extracting velocity estimates from the data provided by the thermal sensor, and the analysis presented here placed a lower limit on the correlation between the velocity estimates from the infrared data and the physical velocities present in the scene. This approach was intended to give an overall ranking to the various velocity estimation algorithms tested, rather than calculate an absolute fall detection efficiency measure.

80

Bromiley, Courtney and Thacker

4.5. Neural Network Fall Detector Once the method for extracting estimates of velocity from the infrared images had been produced, and a correlation with the physical velocities present in the scene had been demonstrated, the next stage was the construction of a fall detector that took the infrared velocity estimates as input and produced a fall/non-fall decision as output. Neural networks represent a well-established computational technique for making classification decisions on data sets that can be divided into two or more classes. In essence a neural network is a method for encoding a set of decision boundaries in an arbitrarily high-dimensional space. The network typically takes some high dimensional vector as input, and produces a classification of the data as output, based on the position of the vector in the space compared to the positions of the decision boundaries. The main advantage of the technique is that the positions of the decision boundaries can be determined automatically during the training phase. Data is provided in which the classification is known, and a variety of algorithms exist that optimise the decision boundaries. The trained network can then be used to classify data points for which the classification is not previously known. In this case, the input data for the network was high-dimensional vectors of velocity measurements representing temporal windows. Each temporal window of velocity measurements defined a point in the high-dimensional space in which the neural network was operating. The network then attempted to define decision boundaries which encompassed the region of the space containing the points corresponding to falls, thus identifying the characteristic patterns of velocities present during falls. The dimensionality of the input vectors, and thus the number of input nodes in the network, was selected on the basis of the timespan of the falls recorded during the simulations: temporal windows of nine velocity measurments were used.

4.5.1. Data Preparation,

Network

Design,

and

Training

In order to provide known classifications for the data points used in the neural network training, some method for labelling the positions of the falls in the data was required. Therefore, the infrared images were synchronised with the colour images, and the approximate positions of the falls were identified by visual comparison. Then the three points of highest velocity

Design of a Visual System for Detecting Natural

Events

81

during each fall were labelled as falls, and the remaining data points were labelled as non-falls. This provided approximately 50,000 classified data points, of which 20% were extracted at random for neural network training, leaving 80% for testing the networks. A variety of neural networks were trained, in an attempt to find the optimum architecture for this problem, varying all available parameters including: • • • •

the number of hidden nodes, varied between 2 and 21; the number of hidden layers (1-2); the initialisation factor for the network weights; the training algorithm - either RPROP (resilient back-propagation) or CGM (conjugate gradient minimisation); • number of iterations, from 200 to 2000.

A total of 120 combinations of MLP architecture and training regime were used. 4.5.2.

Testing

The trained neural networks were tested in two ways: by plotting ROC curves describing their performance on the full data file (including the data not used in training), and by comparison with the ROC curve for a nearest neighbour classifier applied to the same data. For large data sets such as this the nearest neighbour classifier approaches a Bayes optimal classification, and so provided an upper bound on the performance of the neural networks, which was used as a benchmark during network training. However, it requires a prohibitively large amount of processor time to run and so did not form a viable solution to the problem of fall detection in real time. The trained neural networks gave output in the form of a probability i.e. ranging from 0 to 1, with higher values indicating that the input data were more likely to represent a fall. Rather than apply a simple threshold to this output, counting all values above the threshold as falls and all values below it as non-falls, a more sophisticated method was applied in order to increase detection reliability. The output from the network was monitored for local peaks in the probability, and then the height of the peak was compared to a threshold. This ensured that, during extended high-velocity events such as falls, the network issued only one detection, rather than issuing a series

82

Bromiley,

Courtney and

Thacker

of fall detections for every point during an event that generated a network output higher than the threshold. The issue of which quantities to plot on the axes of the ROC curves was problematic in this case. Firstly, falls were marked in the input data only at the three points of highest velocity during the fall. The falls were, however, extended events covering more than three frames, and so it was reasonable to expect the network to issue detections shortly before or after the marked three-frame window. It was therefore unreasonable to count as correct only those detections which coincided exactly with the marked positions of falls. Secondly, the network might issue more than one detection during the course of a fall. This provided the choice of whether to count all of the detections occurring close to a marked fall as correct detections, or whether to reverse the problem and look at how many marked falls had one or more detections in their temporal vicinity. Finally, a similar issue applied to false detections. If false detections could be caused by an event such as the subject sitting down, which was not marked as a fall but nevertheless caused a local period of high velocities, then several false fall detections might be issued within this period. This raised the problem of whether to count these as multiple false detections or as a single false detection, which in turn raises the logical problem of how to specify non-events. It should however be noted that any choice of how to count correct and false detections would allow the network performances to be compared as long as the same procedure was applied to the outputs of all networks. The approach chosen for plotting the ROC curves for the neural network outputs was as follows. The outputs from the neural networks were monitored, and the heights of local peaks were compared with some threshold. Peaks higher than the threshold represented fall detections, and the positions of these fall detections in the input data file were recorded. A second loop scanned through the data looking for the specified positions of falls. Any true fall that had one or more fall detections by the neural network within 20 frames in either temporal direction (0.66 seconds at 30fps recording rate) was counted as a correct detection, and any specified fall that did not have such a detection by the network within that time period was counted as a real fall not detected. Therefore the reconstructed signal axis of the ROC curves showed the number of genuine falls detected by the network as a proportion of the total number of genuine falls. This treatment ensured ease of comparison between the different networks, as

Design of a Visual System for Detecting Natural

Events

83

the maximum possible number of events represented in the recovered signal measurement was limited to the number of labelled falls in the data. It avoided the potential problems inherent in examining the raw number of detections by the network e.g. if a particular network produced multiple detections during a small percentage of the labelled falls, but did not detect the remaining labelled falls, looking at the raw number of detections would give that behaviour an unfair advantage, whereas looking at the number of genuine falls which had one or more detections did not. Finally, the whole procedure was repeated, varying the threshold to which the heights of local probability peaks were compared, to plot out the whole span of the ROC curve. The treatment applied to generating data for the error rate axis of the ROC curves was slightly different. The same arguments applied, in that there were underlying events in the data (e.g. sitting down) which might generate one or more false fall detections by the neural network. However, in this case there was a logical problem of how to specify a non-event. Therefore, in the absence of a viable solution to this problem, the error rate was calculated as the number of false detections divided by the total number of data points which did not fall within a forty-point window around one of the falls. Plotting the ROC curves for the neural networks provided a method for picking the best-performing network, but it was also desirable to compare them with the optimal classification performance. Therefore, the results were compared to the results obtained from a nearest neighbour classifier. The nearest neighbour classifier can be shown to approach the Bayes optimal classification i.e. the classification that would be obtained if the underlying probability distributions which generated the data were known and were used at each point in the input pattern space to make the classification decision. A nearest neighbour classifier operates by searching for the n nearest points to each data point using e.g. a Euclidean distance metric, and then taking an average of the labels assigned to these points i.e. whether the point has been specified as belonging to a fall or not. A threshold can then be applied to this score, and points with higher values represent fall detections by the nearest neighbour classifier. As with the neural network, this output threshold was the parameter that was varied to plot the whole range of the ROC curve. The treatment applied to convert these detections into percentages for ROC curve plotting was kept exactly the same as the

Bromiley,

84

Courtney and

Thacker

0.DD3

0.004

(a)

>

_^-"""^ T

i

/ / i i

: /j..---'

•

"

,»—*

/

^

>

f

-*-*'" *--" -*"

,

—

-

•

-

'

"

'

'

*

"

, - - • '

x''

.

Neatr. neighb.

^

Net 89

-*''" __/ . - • * • '...'•'" >'"*

-*-

Net 33

--•--

A - " '

_ '

.*—•—-

Mlt^-.'-—-:-r-r*. 0.00025 0.0005 0.G0D75

0.001

__--*"

"' G.0D125

0.0015

0.00175

0.002

(b) Fig. 4.11. ROC curves (a) for the nearest neighbour classifier using 30 neighbours, net 89, net 33 and for classifications based on single velocity measurements. The x-axis shows the error rate as a proportion of the total number of data points and the y-axis shows the proportion of true falls detected. A detail from the plot is shown in (b).

procedure used with the neural network ROC curves, including scanning for local peaks in the output, to ensure that the curves could be compared.

Design of a Visual System for Detecting Natural

Events

85

The number of nearest neighbour points used to make the classification decision was varied between 10 and 50, and the best-performing nearest neighbour classifier was selected using the ROC curves. In order to produce a measurement of the performance improvement gained through applying a neural network to this problem, software was written to make the classification decision based on single velocity data points (the lower bound). The central velocity point from each nine-point window was used in place of the neural network output, scaled so that all downwards velocities lay between 0 and 1, but the remainder of the decision code was kept exactly the same as for the neural network. The ROC curve for this classification system was produced and compared to those for the neural networks and nearest neighbour classifiers. Fig. 11 shows the ROC curves produced by the methods outlined above for the best-performing nearest neighbour classifier (using n = 30); the best-performing neural network trained-with CGM (net 89); the best performing neural network trained with RPROP (net 33); and the single data point decision system. The best performing neural network was therefore net 89, which had 18 hidden nodes in one layer and was trained with 2000 iterations of CGM. No further improvements in performance were gained either through increasing the number of training iterations, or by using more data in the training phase. The ROC curve for this network lay reasonably close to that for the nearest neighbour classifier, with error rates around 2 times higher. However, the error rate for the single data point decision system was a further factor of five higher, showing that considerable performance gains were realised through the application of a neural network to this problem. 4.6. Conclusions Overall, it is clear that the approach of calculating vertical velocities from the infrared images and classifying them as fall or non-fall using a neural network is sufficient to produce a practical fall detector. Close examination of the scenarios that led to false fall detections showed that most were due to movements specific to the simulations, to movements occurring immediately after a fall, or to changes in the viewing angle of the cameras during the simulations. All of these can justifiably be ignored. The remaining false detections occurred during high-speed sitting events that could more accurately be described as falls into chairs. It might be expected

86

Bromiley,

Courtney and

Thacker

from a comparison of the basic physics of falls into chairs and falls to the ground that these two classes of events would be indiscriminable in terms of the distributions of vertical velocities they generate, and the performance of the nearest neighbour classifier, an honest classifier, when applied to such events strongly supports this. In order to calculate the performance that would be seen on realistic data additional information, such as the number of times per day that the average subject sits down, would be required. The detection efficiency for true falls can be varied by changing the threshold applied to the output of the neural network. The percentage of true falls detected will also be the percentage of events in the classes indistinguishable from falls that are falsely detected as falls. For example, if the system is tuned to detect 50% of all true falls, then 50% of all high-speed sitting events, i.e. those sitting events involving a period of free-fall, will generate false fall alarms. This proportion, multipled by the number of such events that occur each day with a typical subject, will give the average daily false alarm rate. It is probable that, given the target subject group for this system, such events would be rare and thus the false alarm rate would be low, but only a further study involving evaluation of the prototype system in a realistic environment could determine this. The study of performance evaluation is vital in placing machine vision on a solid scientific basis. We have described a case study: the development of a vision system to detect natural events in a low-resolution image stream. The work has involved two distinct examples of the assessment of algorithmic design decisions to maximise detection reliability. In the first example this assessment was carried out by comparing measures and estimates made by the system under development with measures obtained independently, in the absence of genuine ground truth data. We have shown that even when these independent measures are themselves noisy, their independence can serve to guide rational design decisions and allow performance estimates to be made. In the second example we have shown that the temporal identification of events can be subjected to a similar performance analysis, and that upper and lower bounds placed on the data by independent classifiers can guide algorithmic design in a different way, providing an estimate of the proximity of the system to optimal performance. In both cases the analyses were performed using ROC curves, showing that, with suitable consideration of the definitions of true and false detection rates, such curves can provide a unified, generic approach to performance evaluation

Design of a Visual System for Detecting Natural Events

87

in a wide range of machine vision problems. We therefore believe t h a t , although presented here for one specific system design, such an approach will be applicable to other situations when an image-based system is to be used in the analysis of n a t u r a l scenes in the absence of a precise ground t r u t h . Acknowledgments T h e authors would like to acknowledge the support of the M E D L I N K programme, grant no. P169, in funding part of this work. T h e support of the Information Society Technologies programme of the E u r o p e a n Commission is also gratefully acknowledged under the P C C V project (Performance Characterisation of C o m p u t e r Vision Techniques) IST-1999-14159. All software is freely available from the T I N A website www.niac.man.ac.uk/Tina. References 1. K.W. Bowyer and P.J. Phillips, Empirical Evaluation Techniques in Computer Vision, IEEE Computer Press, 1998. 2. H. I. Christensen and W. Foerstner, Machine Vision Applications: Special issue on Performance Characteristics of Vision Algorithms, vol. 9 (5/6), 1997, pp.215-218. 3. R. Klette, H.H. Stiehl, M.A. Viergever and K.L. Vincken, Performance Characterization in Computer Vision, Kluwer series on Computational Imaging and Vision, 2000. 4. J. West, J.M. Fitzpatrick, et al., Comparison and Evaluation of Retrospective Intermodality Brain Image Registration Techniques, J. Comput. Assist. Tomography, 21, 1997, pp.554-566. 5. P.J. Phillips, H. Moon, S.A. Rizvi and P.J. Rauss, The FERET Evaluation Methodology for Face-Recognition Algorithms, IEEE Trans PAMI, 2000. 6. E. Guelch, Results of Tests on Image Matching of ISPRS HI/4, Intl. Archives of Photogrammetry and Remote Sensing, 27(111), 1988, pp.254-271. 7. I.T. Phillips and A.K. Chhabra, Empirical Performance Evaluation of Graphics Recognition Systems, IEEE Trans PAMI, 21(9), 1999, pp.849-870. 8. P.A. Bromiley, N.A.Thacker and P. Courtney, Segmentation of colour images by non-parametric density estimation in feature space, Proc. BMVC 2001, BMVA, 2001. 9. E.J. Pauwels and G. Frederix, Finding Salient Regions in Images: NonParametric Clustering for Image Segmentation and Grouping, Computer Vision and Image Understanding, 1999 75 nos. 1/2 p73-85. 10. J.D. Foley, A. van Dam, S.K. Feiner and J.F. Hughes, Computer Graphics, Principles and Practice, Addison-Wesley, Reading, 1990. 11. R.W.G. Hunt, Measuring Colour, Second Edition, Ellis Horwood, 1991.

This page is intentionally left blank

CHAPTER 5 T a s k - B a s e d E v a l u a t i o n of I m a g e F i l t e r i n g w i t h i n a C l a s s o f Geometry-Driven-Diffusion Algorithms

I. Bajla, I. Hollander Austrian Research Centers Seibersdorf, Seibersdorf, 2444 Austria E-mail: [email protected]

V. Witkovsky Institute of Measurement Science, Slovak Academy of Sciences 842 19 Bratislava, Slovak Republic E-mail: [email protected]

A novel task-based algorithm performance evaluation technique is proposed for the evaluation of geometry-driven diffusion (GDD) methods used for increasing the signal-to-noise ratio in MR tomograms. It is based on a probabilistic model of stepwise constant image corrupted by uncorrelated Gaussian noise. The maximum likelihood estimates of the distribution parameters of the random variable derived from intensity gradient are used for characterization of staircase image artifacts in diffused images. The proposed evaluation technique incorporates a "gold standard" of the GDD algorithms, defined as a diffusion process governed by ideal values of conductance.

5.1.

Introduction

T h e key importance of evaluation and validation of computer vision (CV) methods for their practical acceptance lead CV community in the recent years to a considerable increase of activities in this field 1 ' 2 ' 3 - 4 . O u r goal in this p a p e r is to contribute to this trend by addressing evaluation of performance of t h e geometry-driven diffusion filtering algorithms used in Magnetic Resonance (MR) imaging. We propose an evaluation methodology t h a t can be related to the task-based type of evaluation of CV algorithms, 89

90

Bajla, Hollander,

Witkovsky

introduced and discussed at the 9th TFCV workshop 5 . With the advent of fast MR acquisition methods in the late eighties, 3D visual representation of anatomical structures and their morphometric characterization became attractive for clinical practice. Unfortunately, while providing access to anatomical information through high-speed data acquisition, the fast MR imaging techniques suffer from a decrease of the signal-to-noise ratio (SNR). However, image segmentation, that is prerequisite of 3D visualization 6 and morphometry 7 in medical imaging, requires the high SNR. Since fast acquisition of a huge volume of data prevents from the use of on-line noise suppresion methods, off-line edge-preserving image filtering techniques are of great interest. In the recent decade nonlinear geometry-driven diffusion (GDD) filtering methods proved to be relevant for increasing SNR, especially in medical imaging applications 8>9>10. While the theoretical exploration of various GDD approaches advanced considerably in the recent decade n > 12 ' 13 > 14 ; particular aspects of discrete implementation of the algorithms and their performance have been addressed only in a few papers 8 ' 15 > 16 ; though the quantitative methods of algorithm performance evaluation are important for practical acceptance of the algorithms 1. A problem of staircase artifacts which occur in diffused images represents one of such aspects. In this paper we focus our attention on developing an evaluation technique that enables characterization of such artifacts in the GDD-filtered MR-head phantom. Several authors addressed problems of noise modeling in MR data 17>18>19 The conditions have been established under which the basic model of Rician distribution of the noise 20 derived for the magnitude image in MRI may be replaced by Gaussian distribution. Thus, in our evaluation study the stepwise constant MR-head phantom corrupted by uncorrelated Gaussian noise (Fig. 5.1) satisfying such conditions will be used as ground truth. Further we introduce a notion of "gold standard" for the GDD algorithms that can serve a sound basis of evaluation studies. 5.2. Nonlinear Geometry-Driven Diffusion Methods of Image Filtering Nonlinear geometry-driven diffusion methods based on parabolic partial differential equations with the scalar-valued diffusivity (conductance), that is given as a function of the differential structure of the evolving image itself, constitute a wide class of diffusive image filtering methods which are

Task-Based Evaluation of Image Filtering

91

Pig. 5.1. The artificial phantom of the MR head tomogram as an ideally segmented mode! and reference image for evaluation of GDD-llteriag performance. The intensity values in individual regions are as follows: l-ventricles(35), 2-white matter(130), 3-grey matter(SO), 4-CSF(30), 5-subcutaneous fat(240), 6-image background (0) .

frequently used in various applications, including those of medical imaging. The essential goal of these methods is edge-preserving noise suppression. The application of these methods to the phantom, comprising piecewise constant intensity regions corrupted by some noise, enables to evaluate the results achieved in terms of the independent homogenization of the individual region interiors and preservation of intensities on given boundaries. This is actually the sense of evaluation studies in the field of image filtering methods. The basic mathematical model of the GDD-filtering proposed by Perona and Malik 21 is described by the partial differential equation: 01/dt = div [c (\VI(x, y, i)\) W ( x , y, t)],

(5.1)

where \VI(x,y,t)\ is the gradient magnitude of an image intensity function I(x,y,t), the conductance c(-) is a spatial function of the gradient magnitude that may vary in time t. The substantial point of the application of any GDD method is the transition from continuous mathematical setting to discrete numerical schemes for solving differential equations. The usual way to discretize the differential equation (5.1) is to define a regular grid of points in each spatial dimension and in time (scale) dimension. In image processing applications the spacing in the spatial domain is a priori determined by the distance $ of adjacent image pixels (we call the corresponding spatial grid shortly the <J-raster of

Bajla, Hollander,

92

Witkovsky

the initial image). The time spacing At is to be set in the discrete diffusion scheme. The spatial derivatives are approximated by centered differences, whereas the forward differences are used for the time variable. Two basic discretization schemes are being used in this area: (i) explicit 21 ' 8 , and (ii) semi-implicit 15>16. For digital image evolution based on the diffusion, apart from numerical aspects, the discretized values of the conductance play an important role. The conductance is a specific function the extreme values 0, 1 of which should represent the membership of the raster point to a region interior or region boundary, given by the a priori partitioned phantom image. Therefore, considering the discrete diffusion formula in the case in which the extremal values of the conductance are used in each diffusion iteration may provide us with useful information on filtering limits of the discrete diffusion process in the given image. Without loss of generality we can discuss only the explicit formula of diffusion. Let us consider the expression (see, e.g., [8] or [21]) of such a scheme, which is given for an arbitrary pixel I(i,j): Il+l = I\3 + T[CN-DNP

+ cs-DsP + cw-DwI1 + cE-DEI%

(5.2)

where l\^ is the transformed intensity in the t + 1-th iteration and /*• is a result of the previous iteration t, r is a time step. The terms DxP, denote the intensity differences for four neighbors in the 4-neighborhood of the central pixel I(i,j), i.e., D^P = I(i - \,j) - I(i,j); Dsl1 = I(i + l , J ) - - f ( M ) ; etc. In the evaluation studies the formula is to be applied to diffusion of noisy image phantom with a priori known region interiors and boundaries. For the diffusion method being evaluated first the conductances in the pixels are calculated which take the real values from the interval (0,1). Here, the conductance function can be interpreted as an inverse value edge indicator (i.e, an edge is indicated by a number approaching 0, ideally equal to 0). If we treat the formula as a local operator, ex are weighting coefficients of the corresponding differences Dxl- In an ideal case, for which the values of the conductances are calculated precisely as binary atributes of the membership of the given pixel to region interior or boundary, we would expect extreme contributions from the neighboring pixels. Obviously this is unreachable in the discrete diffusion scheme, since even in the case of extremal values of conductances the values ex always represent interpolated conductance values in the (J/2-subraster of the i-raster . This situation is

Task-Based Evaluation of Image Filtering

93

c(8/2) 4-

1

1„ „, ^

"| / 2

™- * *

p*- — **** {fw *«*• •'M* •* w •****•

s

(1.1)

(0,0)

(1.0)

Fig. 5.2. Linear interpolation of the conductance values c(S/2) from ci, c-2 of the adjar cent pixels.

illustrated in Fig. 5.2. For any two adjacent pixels of the initial ^-raster in the 4-neighborhood of the pixel being diffused, let the conductance values be denoted as ci, C2. Then the interpolated value of the conductance c(S/2) takes the values from the plane depicted in Fig. 5.2 by shaded surface. As to the conductance values, the similar can be applied to implicit diffusion formulae, since all of them comprise interpolated values of the corresponding conductances (see, e.g., [16]). 5.8. Diffusion-Like Ideal F i l t e r i n g of a Noise C o r r u p t e d Piecewise C o n s t a n t I m a g e P h a n t o m For the piecewise constant phantom an internal pixel vm € ilm is-defined as the pixel for which all pixels from its 4-neighborhood belong to the same region ilm. The pixel hm € O m is called boundary pixel of the region I If^, if at least one of its four neighbors belongs to a different region iln. For any two neighboring regions O m , O n of the phantom the following four adjacency pairs axe possible: (vm,vm),(vm,hm),(hm,hm))(hm,hn). Two relations (hm,vm), (hn, hm), which are reflective to the previous ones do not constitute extra cases and therefore we will exclude them from our

94

Bajla, Hollander,

Witkovsky

consideration. On the other side it is obvious that the couples (vm,v„) and (vm, hn) do not exist in our model. As a model of limits of the diffusion filtering we suggest an iterative diffusion-like operator in combination with a piecewise constant noisecorrupted image phantom. The operator itself uses explicit diffusion formula, but similar approach can be used for implicit schemes. Formally, the operator is identical to the diffusion formula (5.2): J?/1 = 4 +

T[WN-DNP

+ ws-DsP

+ ww-DwP

+ WE-DEP}.

(5.3)

Here, however, the coefficients WN, WS, WW,WE denote the weights of intensity increments from adjacent pixels which are given in Table 1. In the Table we also summarize the possible values of the interpolated conductances (in (J/2-raster) for ideal conductances ci,C2. We introduce the following two Table 5.1. The values of the weights in FILTIDEAL filtering and conductances in DIFIDEAL diffusion. Adjacent pixel type \Vm, Vm )

{Vm, hm) {hm, hm) \tlmi

tin)

weights Wx 1 1 1 0

ci(<5),c2(<5) Ci = 1, C2 = 1 Cl = 1, C2 = 0 C\ = 0, C2 = 0 C\ = 0, C2 = 0

c(S/2) 1 1/2 0 0

concepts: • Ideal diffusion (DIFIDEAL) is an iterative GDD process over the input image with known region boundaries that is governed by (5.2) with the conductances ci,C2 given in Table 1. • Diffusion-like ideal filtering (FILTIDEAL) is an iterative filtering of the input image with known region boundaries that is governed by (5.3) with weights u>x given in Table 1. The difference between the ideal diffusion DIFIDEAL and the diffusion-like ideal filtering is straightforward. Namely, in the case of {vm,hm) the best value of conductance approximation reachable in cases (ci,C2) = (1,0) or (0,1) is c(5/2) = 1/2. In the case of (hm, hm) we get zero increment in the diffusion scheme (5.2), i.e. no smoothing is possible within the boundary pixels of one region. Here c(6/2) corresponds to conductance values ex for X= N, S, W, E. It can also be noted from Table 1 that smoothing in the FILTIDEAL case is allowed for all kinds of adjacent pixels within the given

Task-Based Evaluation

of Image

Filtering

95

phantom region Ctm. It is stopped only in the case of two adjacent boundary pixels from distinct regions fim, Qn. For each iteration of any of these iterative procedures of phantom filtering the output image represents an extremal case that cannot be achieved by any diffusion method. Therefore the results of DIFIDEAL and FILTIDEAL filtering cases can be considered as standard results for GDD image filtering based on explicit discrete formula. We are also interested in developing such an evaluation of individual GDD algorithms that could quantify the visual appearance of resulting phantom images. It seems reasonable to base such an evaluation on the stochastic model of an appropriate quantity derived from image intensities. We consider the following ideas to be relevant for the stochastic model design. Considering an image phantom with a priori known boundaries enables us to build a reasonably simplified stochastic model that features two important aspects: (i) the formula of the random variable and (ii) the choice of the specific pixel subsets as random variable samples.

5.4. Stochastic Model of the Piecewise Constant Image Phantom Corrupted by Gaussian Noise For any two adjacent regions in the piecewise constant image phantom a simple image model consisting of two homogeneous regions, fii and fi 2 having different constant pixel intensities can be considered. Let this image be corrupted by Gaussian uncorrelated noise so that the pixel intensities in fix region have the probability distribution X\ ~ N(/j,i,a2). Similarly, the pixel intensities in ft2 region have the probability distribution X% ~ N(fi2, o-!)- For simplicity, in the following we will assume that a2 = a2 = a2. So, the intensity value in a given pixel can be thought of as a realization of a random variable X\ va.VL\ and X2 in f22Let us take two arbitrary pixels within fii. The intensity values in these pixels are realizations of two independent and identically distributed random variables, say X[ ~ N(fii,a2) and X" ~ N{^i\,a2). The difference between intensity values is defined by Ai = X[ — X". Under given assumptions Ai ~ N(0,2a2). Similarly we can define A 2 = X2 - X2 the difference of intensity values of two pixels from fi2. Under the assumption that X2 ~ N{fX2,o-2) and X2' ~ ./V( M2 ,CT 2 ) we get A 2 ~ N(0,2a2). Now, we will consider the 4-neighborhood of the pixel with the indices

Bajla, Hollander,

96

Witkovsky

(i,j). We define the differences of intensity values A,(i,j) Aj(i,j)

= X(i + l,j)-X(i-l,j), = X(i,j + 1) - X(i,j - 1),

(5.4)

where the independent random variables X(i — l,j),X(i + l,j),X(i,j — l), and X(i,j + 1) represent intensity values in pixels belonging to the 4neighborhood of the pixel with indices (i,j). The distributions of Aj(z, j ) and Aj(i,j) depend on the distributions of X's. There are just two possibilities for the distribution of the particular X. If X is from fii then X ~ 7V(/xi,<72), if X is from 0 2 then X ~ N(/j,2,a2). The distribution of the central digital gradient magnitude (simply gradient) denned by G' = G'(i,j) = ^A2(i,j)

+ A2(i,j).

(5.5)

is of interest. 5.5. Estimates of Probability Distribution Parameters for Characterization of Filtering Results The squared gradient is the argument of the conductance function and moreover, it is suitable for stochastic treatment. Therefore, we will consider the random variable (G') 2 = | ( A 2 ( i , j)+A?(i,j)) instead of the variable G'. Herefrom we denote (G') 2 = G. The following stochastic model is strictly valid only for the initial image. Yet, in our experiments described below, this model (using maximum likelihood estimation) proved to be reasonably robust even after iterations, which violate the assumption on neighbor pixel independence as well as on the normal distribution. More importantly, it proved to perform well as a vehicle to the quantification of the filtering process. In our image model we admit only four different cases of the relation between the pixels in the 4-neighborhood and the neighboring regions fix, f22Therefore, the individual probability distribution should be derived for each such case. (1) Case 1: All pixels in the 4-neighborhood come from the same region fii or n2. Then A, ~ N(0,2a2) and Aj ~ N(0,2a2) are independent random variables, as are the squares A 2 ~ (2CT 2 )X|(A) and A 2 ~ (2a2)X2(X). Consequently, G ~ \(2a2)X22 = fa2)xl where X2 denotes a random variable with central chi-square distribution and two

Task-Based Evaluation of Image

Filtering

97

degrees of freedom. Notice that the theoretical expectation of G in this case is E(G) = a2 and variance Vor(G) = a4. (2) Case 2: If X(i+1, j) and X(i — 1, j) come from the same region, say fii, but X(i, j + 1) and X(i, j — 1) come from different regions, say fii and il2, then A, ~ N(0,2a2) and Aj ~ iV(/xi - /J 2 ,2cr 2 ), and A? ~ (2cr2)xf and A 2 ~ (2<72)xi2(A) are independent random variables. The first one has distribution proportional to the central chi-square distribution with one degree of freedom, the second one has distribution proportional to the noncentral chi-square distribution with one degree of freedom and with the noncentrality parameter A = (/xi — ii^)2 j(2a2). Finally, we get that G ~ {^P2)X22W *s a random variable with distribution proportional to the noncentral chi-square distribution with two degrees of freedom and the noncentrality parameter A = (/xi — 112)2 / {^a2). Notice that the distribution of G does not depend on particular ordering of regions fii and Q2 in the possible directions. Here, the expectation of G is E{G) = icr 2 (2 + A), and variance Var(G) = a4{\ + A). (3) Case 3:lfX(i + l,j) and X(i-l,j),&nd&lso X(i,j + 1) and X{i,j-1), come from different regions, say fli and Q2, then A* ~ N(m — ^2,2cr2) and Aj- ~ N{fr - ix2,2u2). Consequently, A 2 ~ (2CT2)X'I2(A) and A 2 ~ (2<72)xi2(A) are independent random variables with distribution proportional to the noncentral chi-square distribution with one degree of freedom and with the noncentrality parameter A = (/ii — y^) 2 /(2a 2 ). Then, G ~ (|CT 2 )X2 2 (2A) is a random variable with distribution proportional to the noncentral chi-square distribution with two degrees of freedom and the noncentrality parameter (^1 — \xi)2 ja2. The expectation of G is E[G) = a2(I + A), and variance Var(G) = a4(I + 2A). (4) Case 4: If X(i + l, j) and X(i — l,j) come from the same region, say fix, and X(i, j +1) and X(i, j — 1) come also from the same region, however different from fti, say 0 2 , then Aj ~ iV(0,2o-2) and Aj ~ JV(0,2<72). Consequently, the Case 4 reduces just to the Case 1.

In Fig. 5.3 all three discussed cases are illustrated. In all cases, A and a2 remain unknown parameters. If we assume that we have a random sample of squared gradients from one of the three particular cases, then the maximum likelihood (ML) estimation of the parameters is of interest. In general, if Gi,..., Gn is a random sample from cr2X22(A), and A = (/zi — HI)2 ja2, then

m

Bajla, Hollander,

%«*e^^'S?(E

\

>

S

Witkovsky

*F

: 1- :

'

^

i

c-i^hi

1;

CASE 2

id

1

A

! I

CASE 3

Fig. 5.3. The scheme of three possible types of neighboring regions for the 4-pixei neighborhood and the corresponding distributions of the quantity G .

the maximum likelihood equations for A and a2 are J2h

f V W * 2 ) A J (Gi/a2) = n ,

a2

(v + A)

if G > va3,

(5.8)

(5.7)

Task-Based Evaluation

of Image

Filtering

99

where

(5-8)

G=^I>> i=\

H*>) = ~j

rr>

(5-9

with ^ = 2. Here, the function Iv{z) is the modified Bessel function of the first kind of order v,

The ML equations (5.6) and (5.7) are modified (corrected) versions of (29.34a) and (29.34b) from [22], p. 452, in accordance with [23]. For more details on noncentral chi-square distribution 22 . In practical applications, to find the ML estimates A and a2, the original nonlinear system of ML equations (5.6) and (5.7) is to be solved, which have, in general setup d ~ c2X22(A) with separated variables, the form (2 + A)A) ( ^ ) ( 2 + A ) - n = 0 ,

(5.11)

a2 =

. . (5.12) (2 +A) For the numerical evaluation of the left-hand-side of the first equation, at any prior choice of A > 0, it is reasonable to use a piecewise polynomial approximation of the function h(z) for small z. For large z we propose to approximate h(z) by the function h,{z) = z"1 - \z~2

- V3

- \z~\

(5.13)

For more details see Appendix, Part 1. The application of the above results yields the following formulae for maximum likelihood equations of the parameters: (1) Case 1 = 4: For this case, G\,...,Gn represent a random sample from central chi-square distribution, i.e. A = 0, Gi ~ (|cr 2 )x2) f° r i = l,...,n. Then A = 0 and a2 = G. The ML estimator of the

100

Bajla, Hollander,

Witkovsky

expected value of G is E(G) = a2 and the estimator of the variance is Vw{G) = a4. (2) Case 2: Here, G\,... ,Gn represent a random sample from noncentral chi-square distribution, Gi ~ (| cr2 )X2 2 ( / ^); w r t h ^ = (/*i — M2)2/(2tf"2), for z = 1 , . . . , n. The ML estimates are given as a solution of the system (5.11), (5.12), in which r.h.s./2 is to be used. Further, we can introduce ( M l ^ M 2 ) 2 = 2Xa2,

(5.14)

as the ML estimator of the squared difference between mean values of fii and Cl2- The estimators of mean and variance are E(G) = |CT 2 (2 + A) and Vwr{G) =CT4(1+ A). (3) Case 3 : Finally, let G\,...,Gn represent a random sample from noncentral chi-square distribution, Gj ~ (|cr 2 )x 2 2 (2A), with A — (/xi — M2) 2 /(2
^)(t

2A)2A

=

2G . . (2 + 2A)

(2 + 2A) - n = 0 , (5.15)

(5.16)

Moreover, we have (Mi - M2)2 = 2A«J 2 ,

(5.17)

and the remaining estimators E{G) = <72(1 + A), and Var(G) —
Task-Based Evaluation

of Image

Filtering

101

5.6. Implementation Results Goals In this Section we will summarize the results of the implementation of the technique we have proposed for quantitative evaluation of the GDD algorithm performance. The implementation was aimed at the following goals: • to document the efficiency of measuring the level of region interior homogenization, being reached in the process of diffusion, by the quantity Var(G), • to confirm that the application of this measure to subsets of region boundaries (of case 2 and 3, see the previous Section) provides an adequate and sensitive characterization of staircase image artifacts from which the diffused images usually suffer, • to demonstrate that ideal diffusion processes (DIFIDEAL, FILTIDEAL) designed especially for GDD-filtering of the known phantom can actually serve as a gold standard (reference image and characterization) for comparison of the GDD algorithms, • to illustrate that the quantitative characteristic Var(G) applied selectively to region interiors and region boundaries are in a good correlation to visual appearance of the processed images.

Conditions In our calculations the stepwise constant MR-head phantom with 5 regions and 5 boundaries (Fig. 5.1) has been used. The phantom was corrupted by uncorrelated Gaussian noise with the parameter a = 9. This value was chosen to reflect usual conditions of fast MR image acquisition (e.g., via 3D FLASH method). Furthermore, using this value is in accordance with the requirement on Rician distribution of MR noisy data to become Gaussian that is given by the condition: SNR > 10 dB in a pixel 20 . Specifically for our case with the minimum intensity 30 (region 4) we obtain SNR = 101og(s2/
Bajla, Hollander,

102

Witkovsky

First, the optimum value had been calculated using the iteration stopping criterion proposed by us in [10]. To ensure a uniform comparison basis the optimum number of iterations obtained was then applied to the remaining GDD methods involved in our evaluation experiment.

Choice of the GDD algorithms

to be

compared

For the demonstration of the potential of the quantity Var(G) to characterize diffusion algorithms it seems reasonable to choose the conventional Perona and Malik's algorithm (denoted in the Tables as FIXK) with the conventional exponential conductance given as follows:

cond(\VI(x,y,t)\)

= exp

\VI(x,y,t)\l2^ K

(5.18)

where K denotes the relaxation parameter. In practical applications 8 ' 1 0 this parameter is chosen as the quantile q of the histogram of intensity gradient magnitudes. Next, in our previous paper 25 on a locally adaptive conductance (here the algorithm will be referred to as ADAPTKLKR) we compared the proposed algorithm to two different GDD algorithms, namely to Black's 24 (referred to as Black) and Weickert's 16 (referred to as Weickert) algorithms. We consider it appropriate to involve all the three algorithms into our evaluation experiment. Now, we will describe the core of these algorithms, for details see the cited papers. Black et al 24 developed an approach in which boundaries between the piecewise constant regions are considered to be "outliers" in the robust statistics interpretation. For the robust estimation procedure, that estimates a piecewise smooth image from a noisy input, they proposed to use a conductance based on Tukey's biweight robust estimator:

g(x,
fl/2[l I0

{x/affx
. otherwise,

(5.19)

where x = |VJ|. The global parameter a is calculated as a function of the "robust scale" ae : a = \/b ae . For a digital image it is computed using the gradient approximation over the entire image.

Task-Based Evaluation

of Image

Filtering

103

Weickert 16 proposed a rapidly decreasing conductance for GDD filters. Based on our computer experiments and personal communication with the author we proposed a modified formula for the conductance function for which good visual results are achieved for the parameter value m = 2: (l g(\VI„\2) = { 1

I -

|W„|=0 ex

(

r

^

p{dv/:iv>r}i v ^i>°-

( 5 - 2 °)

Here, by |V7o-| the gradient magnitude of a smoothed version of the image 7, obtained by convolving 7 with a Gaussian of standard deviation IT, is denoted. The constant Ci is calculated in such a way that the flux ^>{s) = s • g(s) is increasing for s € [0,7] and decreasing for s £ (7,00), where s = |V7CT|2. This gives C2 = 2.3367. The parameter 7 plays the role of a contrast adjustment: structures with |V7CT| > 7 are regarded as edges, while structures with |V7CT| < 7 are considered to belong to region interiors. Besides the basic mode (with 5 or 7 diffusion iterations) the Black and Weickert algorithms have also been tested for 50 iterations and for optimum number nopt of iterations given by the application of the stopping criterion (SC). For the phantom in the case of the Black algorithm n^pt = 24, whereas in the Weickert algorithm n^L = 18. Finally, the ADAPTKLKR algorithm (with the explicit scheme) is based on making the conductance locally adaptive. It is accomplished by (i) using a pixel dissimilarity measure diss\i,j,t], defined for the pixel neighborhood, and (ii) by incorporating this local measure into the calculation of the variable relaxation parameter Kioc[i, j , t] of the conductance cond(G[i, j,t],Kioc[i, j,i\), defined by the formula: cond(G[i,j,t))

= exp\-

J ^ . V ' . f . . ,, 2 | •

(5.21)

This parameter is given by

Kl0C[iJ,t] = Kr-

fd^J,t}-dissmin\1/P{Kr

_

Ki)

(5 22)

The symbols Ki, Kr denote the extremum values of the relaxation parameter.

Bajla, Hollander,

104

Results

and

Witkovsky

discussion

In Table 5.2 the values of Var{G) for the individual region interiors fij are listed which have been obtained after filtering the MR-head phantom by the selected GDD algorithms. The results are in descending order of Var(G), starting with those for the noisy input phantom regions and ending with the values corresponding to two gold standard algorithms. The method B comes after the method A, if at least three region boundaries (interiors) of B exhibited lower values of Var(G) than those for A. Based on the values listed in these tables the following findings have been established. The quantity Var(G) can serve as a very sensitive measure of region interior homogenization achieved by the GDD filtering. The quantitative characterization of this effect is in a very good accordance with the visual perception of differences in region interior smoothing of the phantom obtained by different GDD filtering algorithms. Similar conclusions can be drawn for the characterization of the level of the region boundary preservation in the process of GDD filtering. Table 5.2. The values of the characteristic Var(G) in five region interiors of the MR-head phantom calculated for the output images of individual GDD-filtering methods. GDD-f method

number of pixels original Black, 7 iters. Black, opt. (24) Black, 50 iters. FIXK, q = 80% Weickert, 7 iters. Weickert, opt. (18) Weickert, 50 iters. FIXK, q = 90% ADAPTKLKR, opt. (7) DIFIDEAL FILTIDEAL

fil ventricles 1804

R e g i o n i n t e r i o r (case 1) fi4 fi2 n3 white CSF grey matter matter 11807 4047 4551

6 2 1 1

6 2 1 1

773.0 625.0 413.2 304.8 83.1 14.5 6.5 5.5 3.6 2.9 2.0 0.7

555.1 582.3 462.6 384.6 83.9 15.4 8.0 6.9 3.8 5.4 1.8 0.7

6 2 1 1

721.8 988.4 906.1 839.7 272.7 102.2 79.8 76.5 25.9 83.4 9.0 1.3

6 2 1 1

441.1 618.6 490.1 421.1 141.9 39.1 26.9 25.1 8.8 11.3 3.8 0.7

n5 fat 4120 5 2 1 1

483.6 143.1 193.7 126.3 105.4 33.3 21.4 19.7 6.2 4.2 3.6 0.8

In Tables 5.3, 5.4 the values of Var(G) for the individual region boundaries Bi (cases 2 and 3 of the distribution of Var{G)) are listed which have

Task-Based Evaluation

of Image

Filtering

105

Table 5.3. The values of the characteristic Var(G).10 - 3 in five region boundaries of type 2 in the MR-head phantom calculated for the output images of the individual GDD-filtering methods. GDD-f method

number of pixels original Black, 7 iters. Black, opt. (24) Black, 50 iters. FIXK, q = 80% Weickert, 7 iters. Weickert, opt. (18) Weickert, 50 iters. FIXK, q = 90% ADAPTKLKR, opt. (7) DIFIDEAL FILTIDEAL

B1 ventricles 124 347.9 286.9 258.9 259.5 196.7 171.3 162.4 158.4 107.1 91.1 79.4 20.2

R e g i o n b o u n d a r y (case 2) Bi B2 B3 CSF white grey matter matter 926 1140 724 339.9 311.7 303.7 302.6 274.2 267.3 264.6 264.1 265.3 -277=9 236.0 212.8

119.3 94.6 85.5 85.0 61.5 51.9 49.7 49.1 46.2 46.4 29.5 6.6

31 31 31 31 31 31 31 31 31 30 31 31

552.6 573.6 599.8 600.7 560.8 768.2 785.2 776.0 115.8 422.6 308.6 057.5

B5 fat 704 3 3 3 3 2 2 2 2 2 2 2 2

739.8 328.2 156.2 060.8 822.6 728.9 642.7 591.5 498.0 461.9 440.7 270.7

been obtained in final output phantom images after GDD-filtering by the algorithms compared. Minor differences in the order of Var(G) values for different iteration numbers of the Weickert algorithm can be explained by the fact that 50 iterationss represent an extreme case. It is not reasonable to include such a quantitative characterization into the overall comparison. It can only serve here for an illustration. On the other hand slightly better result of the ADAPTKLKR, achieved for the boundary B 4 , than the result of the gold standards, has been caused by numerical instability of the solution of the system of nonlinear equations (5.15, 5.16). In Fig. 5.4, 5.5 the output phantom images obtained by the application of GDD-filtering algorithms referred to in Tables 5.2, 5.3, 5.4 are displayed together with their fragments. The images are arranged exactly in the same order as the characteristics Var(G) in Tables 5.2, 5.3, 5.4. Based on the detailed visual analysis (the fragments displayed represent only a part of it) we can conclude that Var(G) properly reflects the degree of homogenization within region interiors as well as the staircase artifacts. The particular numerical values of Var(G) describe the differences between the distinct resulting images in a good correlation with their visual appearance.

106

Bajla, Hollander, Witkovsky

noisy original

Black, 50 iters.

Biack, 7 iters.

Black, opt.

'eickert, Titers.

Pig. 5.4. Diffused images and their fragments ordered according to the increasing smoothing effect and decreasing occurence of staircase artifacts (part I).

Task-Baaed Evaluation of Image Filtering

Weickert, opt.

PTKLKR,opi

Weickert, 50 iters.

DIFIDEAL

10?

F1XK, q^90%

FILTIDEAL

Fig. 5.5, Diffused images and their fragments ordered according to the increasing smoothing effect and decreasing occurence of staircase artifacts (part II).

108

Bajla, Hollander,

Witkovsky

Table 5.4. The values of the characteristic Var(G).10 - 3 in five region boundaries of type 3 in the MR-head phantom calculated for the output images of the individual GDD-filtering methods. GDD-f method

number of pixels original Black, 7 iters. Black, opt. (24) Black, 50 iters. FIXK, q = 80% Weickert, 7 iters. Weickert, opt. (18) Weickert, 50 iters. FIXK, q = 90% ADAPTKLKR,opt.(7) DIFIDEAL FILTIDEAL

Bi ventricles 111 770.6 617.5 560.3 547.0 420.9 371.3 351.7 338.3 221.0 215.4 170.2 80.7

R e g i o n b o u n d a r y (case 3) B2 B4 B3 white grey CSF matter matter 555 754 463 1 1 1 1 1 1 1 1 1 1 1 1

521.2 473.1 452.9 452.9 401.1 386.6 369.6 368.8 374.4 656.0 259.9 196.2

206.0 166.5 150.6 148.2 116.2 95.5 88.4 85.2 77.6 80.5 51.6 25.1

108 108 107 107 107 107 107 106 105 101 106 106

798.9 091.0 798.2 597.5 114.8 129.3 003.3 915.2 127.6 316.4 477.4 207.2

B5 fat 444 12 11 11 10 10 10 10 9 9 9 9 9

475.2 626.4 252.2 999.5 594.1 401.3 089.2 977.8 933.6 803.9 732.4 416.1

Finally, we have compared the estimates of the quantity Var(G) calculated by Maximum Likelihood method to the estimates obtained by the standard moment method. The values obtained on boundaries which characterize staircase artifactcs manifested negligible differences. The acceptable differences have been observed for region interiors. So, the ML estimates proved to be robust enough for characterization results of individual diffusion iterations. 5.7. Conclusions We have presented in this paper a novel task-based algorithm performance evaluation technique related to the geometry-driven diffusion filtering algorithms used for increasing the signal-to-noise ratio in magnetic resonance brain tomograms. The main contribution of the paper is in the development of a probabilistic framework which is suitable for quantitative characterization of smoothing effect in diffused images. In particular, a random variable G has been derived from intensity gradient defined on selected image structures. The parameters of its chi-square distribution are estimated via maximum likelihood method adapted for the specific case of a stepwise constant image model corrupted by uncorrelated Gaussian noise.

Task-Based Evaluation of Image Filtering

109

The results presented in the computer simulation part of the paper show that the values of Var(G) characterize the level of region interior homogenization satisfactorily. Further, the values of Var(G) proved to be a sensitive measure for characterization of staircase artifacts in the diffused images which have been evaluated only qualitatively up to now. Finally, in this paper a notion of gold standards for the GDD algorithms has been introduced and explored. Both extremal cases of the GDD-filtering serve a sound basis for mutual quantitative comparison of the individual algorithms. Though the evaluation technique has been developed for the specific MR head phantom, it can be applied to arbitrary piecewise constant image model corrupted by uncorrelated Gaussian noise. Also, it is applicable to either explicit and semi-implicit discrete schemes.

Acknowledgements The research reported in this paper was supported, in part, by the grants of the Slovak Grant Agency for Science (VEGA) No.2/6017/99,1/7295/20, and by the Hertha-Firnberg-Fellowship awarded by the Austrian Ministry of Science and Transport.

Appendix Part 1 — Approximation

of the function

h(z)

Numerical calculation of the ML estimates requires repeated evaluation of the function h(z) = Ii(z){zIo(z)}~1 which depends on the modified Bessel functions of the first kind of order 0 and 1, respectively. There is a portable software package available for calculation of Bessel functions 26 . Here we discuss possible approximations of h{z). For large z Johnson et al 22 suggested to approximate h(z) by h\{z) = z~l. However, the detailed inspection has shown, that this approximation is inadequate for calculation of the ML estimates in a problem of our type. In Fig. 5.6 the function h(z) is plotted together with three its approximations for large z: h1(z) =

z-\

(5.23)

Bajla, Hollander, Witkovsky

110

h2(z) = z-1 h*(z) = z-1 - \z~2

l

- V

3

-z~\

(5.24)

- \z~A.

(5.25)

For z > 10 we get hi{z)-h(z)

<5.210"3,

h2(z)-h(z)

< 1.510" 4 ,

h4(z)-h(z)

< 2.6 10" 6 .

hi(z) -h(z)

< 5.7 10" 4 ,

h2{z) -h(z)

< 4.8 10" 6 ,

hi(z)-h{z)

< 8.710~ 9 .

For z > 30 we get

For small z it is possible to use a piecewise polynomial approximation or other useful approximation of h(z). In our calculation we have used the following approximation that ensures the error less than 10~ 5 for all z. h(z) = en + c12z + ci3z2 + ci 4 z 3 + a5z4 + c16z5 + cx7z% )iz<

1.5151 l

h(z) = c2iz~

+ c22 + c23z + c24z2 + c25z3 + c26z4 + c27z5 + c2Sz6

if 1.5151 > z < 3 Kz)

= c 3 l Z _ 1 + C32 + C33Z + CMZ2 + C 35 Z 3 + C36Z4 + C37Z5 + C3$Z6

if 3 > z < 10 h(z) = c 4 i z _ 1 + c 42 + ci3z + C44Z2 + c45z3 + c46z4 if 10 > z < 30 h(z) = h4(z) = z-1 - \z~2 - lz~3 if z > 30

\z~4

Task-Based Evaluation

of Image

Filtering

111

where the vectors of the required coefficients are given below: ci = [

5.000015750994815 E-001 - 7.015403181904873 E-005 - 6.181873338493644 E-002 - 2.674979284663376 E-003 1.558614337458592 E-002 - 5.141210530531576 E-003 5.075850841958089 E-004 ]';

c2 = [

6.250068921018347 E-004 7.661065716293550 E-002 7.600354333067073 E-002 2.171700783214764 E-003 -

4.756123409144552 E-001 1.669435610721310 E-001 1.766235739677730 E-002 1.127403006639790 E-004 ]

c3 = [ - 5.642713404322300 E-002 - 2.671307154042414 E-001 - 7.813927867677451 E-003 - 2.929469047034019 E-005

7.335985590582235 E-001 5.804624677802606 E-002 6.402783321777873 E-004 5.742038367545898 E-007 ;

c4 = [ 8.102916289852650 E-001 - 2.116473057740546 E-003 - 1.860820026817683 E-006

2.798000644183518 E-002 8.719134935268606 E-005 1.611056355777470 E-008

Part 2 — Approximate

formulae

for ML

estimators

As illustrated in Fig. 5.6, for large z, the function h(z) given by (5.9) can be sufficiently well approximated by h,2(z) = z~l — \z~2. Then the function h(z) may be utilized for derivation of explicit expression of parameter estimates. They can considerably simplify the computations in applications with sufficiently large values of the argument z. If Gj ~ u2^;'22(A), i = 1 , . . . ,n, then using h,2(z) instead of h(z) and from (5.11), we get the quadratic equation for A: 4(a-l)A2 + 4 ( 2 a - l ) A - l = 0

(5.26)

where a = G/G, G is given by (5.8), and

The maximum positive root of the equation (5.26) is the approximate ML estimate of A.

112

Bajla, Hollander,

1

:

1

!

1

1

Witkovsky

i

i

i

i

\hi:

h(z)

'-. -

Ss\j\

lh4
•

^ ^ \ b ^ '•

i

*

i

i

<

i

i

i

-

i

Fig. 5.6. Plot of the function h(z) = 7i(z){z/o(z)} *, and its approximations for large z. Here, Iv(z) is the modified Bessel function of the first kind of order v, h\(z) = z~l, h2(z) = z'1 - ± z - 2 , and h4(z) = z " 1 - ±z~2 - i 2 ~ 3 - | z " 4 .

In particular, solving the equation (5.26) and using (5.12) we get the explicit (approximate) formulae for ML estimators of A and a2 in general case: - ( 1 - 2a) + y/a(4a - 3) A= - ^ — " 7 7 v 7 ~ ~ — ^ 2(1 - a) «T2

=

(2 +A)

(5.28) (5.29)

Application of the above results leads to the following approximate formulae for maximum likelihood estimates of the parameters: (1) Case 1 EE 4: Here, Gu ..., Gn is such that G, ~ {\
= 2A<7 .

Task-Based Evaluation of Image Filtering

113

(3) C a s e 3: Here, d ~ (±<7 2 )x' 2 2 (2A), with A = (/n - /i 2 ) 2 /(2<7 2 ), * = l , . . . , n . Calculate G, G, and a = G/G. T h e n t h e approximate ML estimator of 2A is given by t h e right-hand-side of (5.28) a n d estimator of a as
+ A). Moreover, (/xi - ^2)

= 2AIT 2 .

We notice t h a t t h e above approximate formulae are more adequate for estimation of the p a r a m e t e r s t h a n t h e approximate formulae based on h\(z) = z"1 (an approximation of the function h(z) for large z) as suggested, e.g., by Johnson et a l 2 2 . All the obtained estimators are summarized in Table 5. Table 5.5. The approximate ML estimators of the parameters of the chi-square distributions related to three studied cases. Case

1

2

3

Distribution

¥2xl

^X'I(A)

|*V!(2A)

A

0

eg. (5.28)

r./i.s.(5.28)/2

<x2

G

2G/(2 + A)

G / ( l + A)

E{G)

G

i a 2 ( 2 + A)

a 2 ( l + A)

Va7(G)

G2

cr 4 (l + A)

CT4(1 + 2A)

References 1. Report on IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms, June 1998, Santa Barbara, CA. http:/www.itl.nist.gov/iaui/vip/2eecv/cvpr98-report.html 2. Phillips PJ, Bowyer KW (1999). Introduction to the special section on empirical evaluation of computer vision algorithms. IEEE Trans on Pattern Analysis and Machine Intelligence PAMI-21:289. 3. Liu X, Kanungo T, Haralick RM (1996). Statistical validation of computer vision software. In: Proc of the DARPA Image Understanding Workshop, February 1996, Palm Springs, CA. 11:1533-1540. 4. Lopez AM, Lumbreras F, Serrat J, Villanueva JJ (1999). Evaluation of methods for ridge and valley detection. IEEE Trans on Pattern Analysis and Machine Intelligence PAMI-21:327-335.

114

Bajla, Hollander, Witkovsky

5. Report on the 9-th TFCV Workshop on Evaluation and Validation of Computer Vision Algorithms, March 1998, Dagstuhl, Germany. http://www.tcs.auckland.ac.nz/~rklette/Announcements/ TFCV1998.html 6. Joliot M, Mazoyer BM (1993). Three-dimensional segmentation and interpolation of magnetic resonance brain images. IEEE Trans on Medical Imaging MI-12:269-277. 7. Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC (1994). Morphometric analysis of white matter lesions in MR images: Method and Validation. IEEE Trans on Medical Imaging MI-13:716-724. 8. Gerig G, Kiibler O, Kikines R, Jolesz FA (1992). Nonlinear anisotropic filtering of MRI data. IEEE Trans on Medical Imaging MI-ll:221-232. 9. Weickert J (1996). Theoretical foundations of anisotropic diffusion in image processing. Computing Suppl. 11:221-236 10. Bajla I, Hollander I (1998). Nonlinear filtering of magnetic resonance tomograms by geometry-driven diffusion. Machine Vision and Applications 10:243-255. 11. Catte F, Lions P-L, Morel J-M, Coll T (1992). Image selective smoothing and edge detection by nonlinear diffusion. SIAM J Num Anal 29:182-193. 12. Alvarez L, Lions P-L, Morel J-M (1992). Image selective smoothing and edge detection by nonlinear diffusion. SIAM J Num Anal 29:845-866. 13. Weickert J (1998). Anisotropic diffusion in image processing. B. G. Teubner Stuttgart. 170 pp. 14. ter Haar Romeny BM (1994). Geometry-driven diffusion in computer vision. Kluwer Academic Publishers. 439 pp. 15. Niessen WJ, ter Haar Romeny BM (1994). Numerical analysis of geometrydriven diffusion equations. In: Bart M. ter Haar Romeny (Ed), Geometrydriven diffusion in computer vision Kluwer Academic Publishers. 439 pp. 16. Weickert J, ter Haar Romeny BM, Viergever MA (1998). Efficient and reliable schemes for nonlinear diffusion filtering. IEEE Trans on Image Processing IP7:398-410. 17. Yue Wang, Tianhu Lei, "A new stochastic model-based image segmentation technique for MR image of MR imaging and its application in image modeling," In: Proc. of the IEEE Int. Conf. on Image Processing and Neural Networks , November 1994, Austin, TE, vol. I, pp. 182-186. 18. G. A. Wright, "Signal acquisition and processing for magnetic resonance imaging," In: Proc. of the IEEE Int. Conf. on Image Processing and Neural Networks , November 1994, Austin, TE, vol. Ill, pp. 523-527. 19. J. Sijbers, A. J. den Dekker, J. Van Audekerke, M. Verhoye, and D. Van Dyck, "Estimation of the noise in magnitude MR images," Magnetic Resonance Imaging, vol. 16, pp. 87-90, 1998. 20. Nowak RD (1999). Wavelet-based Rician noise removal for magnetic resonance imaging. IEEE Trans on Image Processing IP-8: 1408-1418. 21. Perona P, Malik J (1990). Scale-space and edge detection using anisotropic

Task-Based Evaluation of Image Filtering

22. 23. 24. 25.

26.

115

diffusion. IEEE Trans on Pattern Analysis and Machine Intelligence PAMI12:629-639. Johnson NL , Kotz S , Balakrishnan N (1995). Continuous univariate distributions. Volume 2. Wiley, New York. Second Edition. Anderson DA (1981). The circular structural model. Journal of the Royal Statistical Society, Series B 43:131-143. Black MJ, Sapiro G, Marimont DH, Heeger D (1998). Robust anisotropic diffusion. IEEE Trans on Image Processing IP-7:421-432. Bajla I, Hollander I. Locally adaptive conductance in geometry-driven diffusion filtering of magnetic resonance tomograms. IEE Proceedings-Vision, Image and Signal Processing vol. 147, pp.271-282, 2000. Amos DE (1986). A portable package for bessel functions of a complex argument and nonnegative order. Trans Math Software.

This page is intentionally left blank

CHAPTER 6 A C o m p a r a t i v e A n a l y s i s of C r o s s - C o r r e l a t i o n M a t c h i n g Algorithms Using a Pyramidal Resolution Approach

Nuno Roma Instituto Superior Tecnico / INESC-ID Rua Alves Redol, 9 - 1000-029 Lisboa PORTUGAL E-mail: [email protected], URL: http://sips.inesc-id.pt/~nfvr

Jose

Santos-Victor

Instituto Superior Tecnico / ISR E-mail: [email protected], URL: http://www.isr.ist.utl.pt/~jasv

Jose Tome Instituto Superior Tecnico / INESC-ID E-mail: [email protected]

Disparity map estimation is often regarded as one of the most demanding operations in computer vision applications. Several algorithms have been proposed to solve this problem. With such a number of distinct approaches, the question of choosing the most suitable algorithm for a given application is often raised. Few and limited resources can be found in the literature covering this problem. In the following paragraphs it will be presented a comparative analysis of the performance and characteristics of a set of similarity measure algorithms proposed in the literature in the past few years. The obtained results can be regarded as an extremely valuable basis for selecting the most suitable registration algorithm for a given application. The study was focused on the analysis of two distinct aspects: the final matching error and the computational load of each of the considered correlation functions. Besides this comparative study, the advantages of using a pyramidal resolution approach were also considered. This scheme has proved to be effective in reducing the overall computation time and the required number of arithmetic operations, having an insignificant loss in the final matching error.

117

118

Roma, Santos-Victor

and Tome

6.1. Introduction During the past few years, the estimation of the disparity field of an image sequence has been playing an increasingly important role in a large number of computer vision applications, such as video coding, multi-view image generation, camera calibration, 3D reconstruction from stereo image pairs, object recognition, visual control and motion estimation 1, . The main purpose of this research is to present a quantitative and comparative analysis of several cross-correlation similarity measures used in disparity estimation, applying a pyramidal resolution approach. Matching error and computational load measures will be used to compare the several registration algorithms, as well as the different possible configurations of the hierarchical processing scheme. By definition, disparity is the difference between the coordinates of two matched points. The disparity map is composed by a dense field of disparity vectors, one for each matched pixel or group of pixels. However, depending on the considered application, this definition can have some slightly different interpretations. While in video coding this map is a set of motion vectors computed using two distinct images which correspond to two different time instants, in 3D reconstruction it is computed using two images corresponding to two different points of view of the same scene. In this last case, the obtained vectors provide 3D perception by inferring depth information from the scene, thus enabling 3D reconstruction of objects 3 . To make this possible, a high-resolution estimation of disparity vectors of all image pairs is often required, leading to the computation of high accuracy disparity maps. Several different approaches have been proposed to solve this correspondence search problem 1 . Most of them can be classified in three distinct groups: • Feature based algorithms, where some predefined features are extracted from both images, such as edge segments or contours, to compute the correspondence matching field; • Area based algorithms, where registration algorithms using crosscorrelation based similarity measures 4 are used to find the block of image pixels which best matches the one being analyzed in the other image; • Optical-flow algorithms, relying on the relation between photometric correspondence vectors and spatiotemporal derivatives of luminance in an image sequence 5 .

A Comparative Analysis

of Cross-Correlation

Matching Algorithms

119

Although feature based algorithms and optical-flow algorithms have been object of an intense research during the last few years, conventional area based algorithms still remain very popular and will assume an important position in the next future in correspondence computation tasks. The main reason for this fact relies on their simple and straightforward implementation, well suited for parallel implementations based on VLSI circuits or Digital Signal Processors (DSP), as well as their robustness against certain image transformations. Furthermore, they directly provide the dense disparity map, whereas in feature based approaches an interpolation step is required if a dense map of the scene is desired. However, they also have some drawbacks that should not be ignored. The most serious one is concerned with the significant computational load, associated to the computation of the dense matching field. To circumvent this disadvantage, a hierarchical processing scheme is now proposed. Furthermore, although several similarity measure functions have been proposed in the past, few and limited resources can be found in the literature comparing and characterizing them. Since the final performance can be greatly influenced by a correct selection of the registration algorithm, a complete and exhaustive comparison of these functions urges to be performed. With such a study, such as the one presented in this research work, it will be easier to select the most suitable similarity function to be used in a given application, depending on the desired matching error level or on the available computational resources. This research work is organized as follows. In section 6.2 area based matching algorithms will be described, as well as their major advantages and disadvantages, section 6.3 will present the several registration algorithms covered by this research. The pyramidal processing scheme will be described in section 6.4. In section 6.5 it will be presented the used evaluation methods and the obtained experimental results. Section 6.6 concludes the presentation. 6.2. Area Based Matching Algorithms In area based matching algorithms, rectangular blocks of pixels from a pair of M x N images (left and right images) are compared and matched (see figure 6.1). For each block of the right image (reference window), a corresponding block in the left image is sought, using a given similarity measure as main criteria (see figure 6.2). During the search process, the left image block is displaced by integer increments (c, I) around a predefined

Roma, Santos-Victor and Tome

120

region (search window), and an array of similarity scores d (c, I) is computed (see eq. 6.1). The position (CMJM) of the moving block corresponding to the maximum computed value of the considered similarity function 0 for that search window is selected and chosen to obtain the optimum disparity vector corresponding to that reference window. Hence, a matching dense field D (x, y) is computed by using as many overlapping search windows as the number of pixels of the image, thus obtaining a disparity vector for each pixel (see eq. 8.3). ft-width

^length

<«(c,0 = 5Z Yl R(u,v)QS(c + u,l + v) «=o «=o d (cM, LM) - argmax{ d(c,l) (c,0 D (x,y)

= { dxy {CM,IM)

(6.1)

: 0 < c < Swidth ; 0 < I < Siength } (6.2) • 0 < x < M;

0 < y < N }

(6.3)

Several similarity functions have been proposed in t h e literature 4 . T h e selection of a particular function is usually based on its computational load, algorithmic simplicity and achieved performance. Similarity measures based on cross-correlation, normalized cross-correlation, s u m of squared differences a n d s u m of absolute differences a r e often chosen. However, other measures based on co-occurrence matrices have recently been proposed 8 . '•' Moreover, t h e selection of t h e reference a n d search windows width is not a simple a n d trivial task 7 , 1 . I n fact, while t h e probability of a mism a t c h usually decreases when t h e reference window is enlarged, using large windows often leads t o a n accuracy loss, since t h e influence of image differences is greatly aggravated with t h e increase of t h e considered area, a s will b e shown in section 6.5. O n t h e other hand, t h e size of t h e search window il '•.••:••<•• • !'.' • •:< : l i'i ••••<: .ji.i ..-•! '..si-.ii t-\ 1 in s> • UJ i :•-1. •!:•••• i •''. . • . • >r.

Fig. 6.1. Disparity map estimation process.

A Comparative Analysis of Cross-Correlation Matching Algorithms

121

Fig. 6.2. Searching procedure.

The greater the search window, the greater the allowed mobility of a given pixel. Therefore, this window should be wide enough so that it comprises the correspondent block of a given reference window. Furthermore, a difficult and important trade-off must be done when selecting these window widths, since the computational load and processing time usually increase linearly with their areas. Therefore, a compromise must often be done, by adjusting these parameters according to the image size and its contents. Consequently, disparity estimation of a dense matching field is usually considered to be one of the most challenging tasks of 3D reconstruction. This fact can be explained by several reasons, such as: • The dimension of the solution space is extremely large, since each image pixel can be matched with any pixel of the other image. • It is involves computational demanding algorithms. » Its accuracy is extremely dependent on the photometric characteristics of the images being analyzed such as texture, luminance and contrast, and the noise conditions .associated to the acquisition camera system. This task can be particularly influenced in images with lack of information, e.g., regions with constant bright, horizontal edges and repetitive patterns. e The existence of partially or totally occluded objects in the image pair can give rise to disparity errors with difficult solution. The estimation of dense disparity maps is usually performed by taking in consideration a set of constrains relating the two images being analyzed. One of these constrains is the so-called Constant Image Brightness (CIB) assumption, which states that a matched pixel pair should have an equal luminance value 1 . Therefore, the main problem consists in finding the pixel

Roma, Santos-Victor and Tome

122

whose neighborhood best matches the region being analyzed (see figure 6.3). Theoretically, all points of the other image can be accepted as possible candidates to this matching process. However, there is at most only one pixel which corresponds geometrically. Consequently, the CIB constraint alone is regarded as being insufficient for the estimation of dense disparity maps, being often considered an ill-posed problem. This task can be simplified by using some more constrains, making it possible to reduce the dimension and ambiguity of this problem, thus decreasing the total processing time. Some of the constrains more frequently used are 2 : « Epipohr Constraint: provides a conversion from a 2D search into a ID search, by imposing that the matched points must lie on the corresponding epipolar line of the two images; • Uniciiy: imposes that each pixel can have, at most, one correspondent pixel in the other image; ® Smoothness: imposes a continuous and smooth variation of the disparity « Order Constraint: forces the order of the points belonging to an epipolar line to remain the same; » Disparity Gradient: limits the allowed variation of the disparity values. 6.3. C r o s s - C o r r e l a t i o n A l g o r i t h m s The set of similarity functions considered in the comparative analysis are presented in table 6.1, as well as their definition expressions 4 . Due to their potentially robust real-time operation and their moderate requirements in terms of hardware and software resources, only correlation type algorithms were considered in this study. In this table, R(u,v) denotes a reference window pixel, while S(c,l) denotes a search window pixel, R the local mean of the reference window (R = Y^=Ssth Z ^ o " * ^ (w> u))> an< ^ & (c> 0 the pixel mean in the block of the search window being compared (S (c, I) = h dth

E^r EuZh s(c + u,i + v)).

LEFT

Pig. 6.3. Any point of the left image is a possible candidate to the matching process.

A Comparative Analysis of Cross-Correlation Matching Algorithms

123

Table 6.1: Cross-Correlation Algorithms. Correlation Name

Definition

R

Simple

R

length

Cross-Correlation SCC(c,l)

width

Y^

J2

w=0

u=0

R(u,v)

-S(c + u,l + v)

'length R„idth

Normalized Cross-Correlation NCC(c,l)

—. length

\

Y u=0

E

width

E

MOR(c,l)

t

h

2

«i«n 9 th R „ i d t h

E (s(<=+".i+">-s(c-1>)

v=o

E

width

( R ( " ' " > - R ) • ( s ( c + u , ! + «i)-S(o,())

E tt=0 _

"length R„i(ith

2

("("•»>-«) +

u=Q R length Rwidth

lRUngth

Y^

V]

v=0

u=0

E

E

v=0

u=0 2

R

width

2

R

length

width

E («<"•")-«) • E R

Sum of Squared Differences SSD(c,l) Sum of Absolute Differences SAD(c, 1)

ien a th Rwidth V1 V (R(u,v) „f0 u fo "length RWidth ^ E \R(u,v) v=0 length

^2

v=0

NSSD(C,1)

- S (c +u,l + v))2

- S (c + u,l + v)\

R

uiidth

E (R(".,')-s(e+".i+t'))2

u=0

/length " „ i d t h " l e n g t h

w E

E

V

u=0

u=0

2

E (s(<=+».'+«-)-s(c,o)

u=0

R

Normalized Sum of Squared Differences

(s(<=+«.' + »)-S(<=.0)

IB(«,«)-Ji)-IS(c+«,l+»)-s(c 1 l)J

R

i/ E

u=o

R

E u=0 Rlength Rwidth

E Normalized Zero Mean Sum of Squared Differences NZSSD(c,l)

_

u=o 2

v=0

(*<«•»>-*)' (s(=+«.«+«)-sW))

u=0

"length

Moravec

E sa<«+«.<+«>

<"'">- E

E («<"•»>-*) ' E

v =o

Rwidth

R

length

W E

Ur>gth

R2

E

R

R(u,«OS(c+U,< + v) R

Rwidth

Zero Mean Normalized E Cross-Correlation — »=" ZNCC(c,l) /"length R ^ y

^ v=0

H2 U ,

< ' ">-

"„idth

E

E

i; = 0

u=0

s2 c u i i

< + . + ')

Roma, Santos-Victor

124

and Tome

Table 6.1: (continued) Correlation Name

Definition

R

Zero Mean Sum of Squared Differences ZSSD(c, 1) Zero Mean Sum of Absolute Differences ZSAD(c, 1)

Ungth R^idth ^ ] T UR (U, V) - fi) - ( s (c + u, I + v) - S (c, J))] '

Locally Scaled Sum of Squared Differences LSSD(c.l)

v=0

u=0

"•length

R-width

E

E R

Locally Scaled Sum of Absolute Differences LSAD(c, 1)

\(R(U,V)-~R)

length

Ruiidth

J2

£

i»=0

u —0

/

- (s(c

+ u,l + v) -

_

S(c,l)) \

{R(u,v)-=M=-S(c

2

+ u1l + v)\

^ l e n g t h -^uiidth.

E

fi ( u , i > ) •

E

• S (c + u, I + v)

S(c,l)

The functions SCC, NCC, ZNCC and MOR are pure similarity measures, since the best match corresponds to the maximum value obtained with these functions. In contrast, the functions NZSSD, SSD, SAD, NSSD, ZSSD, ZSAD, LSSD and LSAD represent difference or dissimilarity functions and the best match is obtained when the value returned by these functions is the minimum. Some of these functions are normalized versions with respect to the mean and standard deviation of the SCC, SSD and SAD functions. The objective is to make these registration algorithms insensitive to changes of the brightness and contrast of R (u, v) and S(c,l) values 9 . Furthermore, in order to overcome possible distortions of these measures in the vicinity of the image bounds, a block normalization is often performed, by dividing each correlation result by the area of the correspondent reference window: d

"( c >0=p

7R

*Mvidth *

rf

M)

(6-4)

^length

Although the described functions present evident analogies, the correspondent computational load and hardware requirements can be significantly different. While with the SCC, the simplest function, it is only necessary to perform Riength x Rwidth multiply-and-accumulate (MAC) operations, arithmetic units capable of performing squared-root operations {ZNCC, NZSSD, NSSD), absolute-value operations (SAD, ZSAD,

A Comparative

Analysis of Cross-Correlation

Matching Algorithms

125

LSAD) or integer divisions (NCC, ZNCC, MOR, NZSSD, NSSD, LSSD, LSAD) are requited in other functions. These requirements are often an important aspect when selecting the most suitable similarity measure function for a given implementation, as will be further illustrated in section 6.5. 6.4. Pyramidal Processing Scheme In order to obtain a high-accuracy disparity computation, it is important to use reference and search windows large enough to provide the computation with the correct match even when pixel pairs present significant disparity values. However, the computational effort of this correspondence search increases significantly when the area of these windows increase. Furthermore, larger windows are usually associated to longer computation times. One form of solving these implementation issues is to use a hierarchical approach, by using a pyramidal processing scheme like the one depicted DISPARITY M A P

126

Roma, Santos-Victor

and Tome

in figure 6.4 1 ' 6 . With this technique, the matching process is done in a multi-layered fashion and is based on a coarse-to-fine approach, providing significant functional and computational advantages 10 . The left and right images are successively down-sampled by a factor of 2, using a decimation function to obtain lower resolution versions of the original image pair. The original images represent level 0 and images resolution is decreased with the pyramid level. Therefore, the pyramid may be viewed as a 4D data structure, where the intensity of pixels is a function / (I, x, y) with 3 arguments: a level designator (I) and a pair of coordinates (x,y). The matching estimation process is started at level L. This ensures that the earlier correlations are performed with the gross image features rather than with the details. The matching of these gross features will be used to guide the later high-resolution searches, thus achieving more accurate matches of these features and of the surrounding details. After this set of low-resolution pictures has been processed, the obtained disparity map is interpolated to the resolution of level L — 1. These disparity values are then used as an initial estimate for the computation of the disparity map of this level (see figure 6.4). This process continues until estimation of the disparity map corresponding to full resolution (at level 0), is performed. Therefore, to estimate the disparity field using several resolution layers, it is only necessary to repetitively apply the same algorithm to each of the considered levels. Moreover, by using this scheme it is possible to use the same small search and reference windows along all the layered processing scheme. Consequently, each time the images resolution is increased, the coverage of these windows is reduced by a factor of 4, thus providing a gradual refinement of the matching process and a greater treatment of the details. This makes it possible to obtain accurate disparity values and significant coverage areas, which could only be obtained with the usage of larger and more time consuming windows in a single layered processing architecture. Some authors have proposed an additional strategy to speed up the matching estimation, by working with sub-images rather than processing the entire image 11 . However, although with this solution the required memory space is lower, it involves an additional overhead in the whole processing scheme. In the next subsections it will be described, in a more detailed way, several important aspects of the pyramidal processing scheme.

A Comparative Analysis of Cross-Correlation Matching Algorithms

6.4.1. Number

127

of Layers

One of the most critical decisions that usually arise when using a pyramidal processing scheme is concerned with the number of layers used in the structure. By increasing the number of layers it is possible to use smaller reference and search windows, thus leading to faster estimations of the dense disparity field. On the other hand, important features and other types of image information required to the matching process can be lost or distorted when too coarse resolutions are used, giving rise to critical problems in the search process 12 . Moreover, multi-scale image representations should be consistent, since features at different resolutions may be correlated 6,12 . Therefore, significant features at different layers should not randomly appear or disappear when resolution is increased. With careful design of the decimation and interpolation blocks of the hierarchical scheme, satisfactory results can be obtained. However, image size and its contents should always be considered in the decision of the number of layers used by the hierarchical structure. 6.4.2. Decimation

Function

A pyramidal structure is usually implemented by sub-sampling the original image. In order to fulfill the Nyquist theorem, a low-pass filtering of the original image is required to be first performed. The filter implemented in the developed system is a gaussian filter centered at m = (mx, my) and with variance a2, having the impulse and frequency responses given by eq. 6.5, and a 3dB bandwidth given by eq. 6.6.

h(x) = -7L- • e~^~

—>

H(f)= e-27r2°2j2

(6.5)

V27T.<7

BW_3dB-~ln^2

27T <72

_ In (2) _j_

47T2CT2

(6.6)

V2

Therefore, the set of (ML X NT,) pixels that compose the image layer of level L ( / L (i,j)) can be obtained by performing a 2D convolution of the image matrix of level L — 1 with (Mr,-i x NL-I) pixels (/L-I (i,j)), with the impulse response of the gaussian filter h (x, y) followed by the application of the sub-sampling process. The efficiency of this algorithm can be greatly improved by noting that (Mx,_i x NL-I) filter results are computed in this stage and only

Roma, Santos- Victor and Tomi

128

(M L x Ni) = (ML-I x NL-i) /4 pixels are used after the sub-sampling process. If this 25% efficiency is considered i n a f layered pyramid, it is possible to conclude that only (0.25)*- 1 % of the computations are actually used. To circumvent this limitation a different approach has been used: only the pixels which are used in the next sampling process to obtain the new desired layer are actually computed. This procedure can be regarded as a combination of the filtering and sub-sampling phases, thus avoiding the computation of unnecessary results. To avoid a gradual decrease of the precision of the liter results as a consequence of multiple and cascaded quantization steps, the original image has been always used in all filter operations and at all implemented layers. This implies a gradual decrease of the filter bandwidth when the pyramidal algorithm descends to lower layers. As it was shown in eq. 6.8, this can be easily done by increasing the filter variance a2 with the level number. In the developed system, it was considered a variance given by <J\ = 2 x ff|_j in the computation of the filtered layers, using a window of size 6 x a\. This window dimension guarantees that more than 99% of the area of the 2D gaussian filter of eq. 6.5 lies within this window. In figure 6.5 it is presented the result obtained with the application of the described decimation function to the 256 x 256 lenna test image. 6.4.3. Matching

Process

The matching process was performed using the set of correlation based similarity functions described in section 6.3 and presented in table 6.1. To achieve an efficient implementation of each of these algorithms, some manipulations of the expressions presented in table 6.1 have been performed in order to obtain the final result in the minimum number of steps, using the minimum number of operations 11 . This is illustrated in eq. 6.7 through 6.10

Fig. 6.5. Application of a 4 layered decimation function to lenna test image.

A Comparative Analysis of Cross-Correlation Matching Algorithms for the general case of a zero-mean normalized cross-correlation using the following simplified nomenclature described in section 6.3: R = R(u,v), S = S(c + U,l + V), S = S(c,l), L = Rlength, W = Rwidth, E E « = cavcAf,9)

d(c>0 =

(6.7)

varCti (/) x varCti {g)

^(/) = EE(^-^) 2 = EE(*2-2jR*+*2)

=EE*2-2-^+(W)2*^ = EE*2-^S?! (8-8) awc,j (/, 5) = E E (R - ^) (S - S) L.W L.Wx

<^^

L.W

E E ^ EE^

Consequently, for the majority of these algorithms, it is only necessary to compute the values of £^=p Y^=0 R, E t o E ^ o S> E^=o E ^ o R-s> Ei>=o E u = o R2 an< ^ Ej,=o E u = o &2 m o n e s m g l e step. Besides these manipulations, the block normalization of eq. 6.4 was also performed. 6.4.4.

Interpolation

As it was described in section 6.4, the several disparity maps estimated in the lower levels of the pyramidal structure are used as an initial estimate of the disparity fields of subsequent higher levels, following a classic coarse-tofine approach. However, before these initial estimates can be used, a scaling up operation is required to be performed on the disparity map obtained

129

130

Roma, Santos-Victor

and Tome

from the previous layer to conform its dimension and its vectors with the new layer resolution. This function was implemented using a bilinear interpolation algorithm based on the computation of the mean disparity value of the group composed by 4 or 2 neighbor disparity vectors corresponding to the set of pixels belonging to a 3 x 3 interpolation window. 6.4.5. Disparity

Maps

The final result of this hierarchical processing scheme is a dense disparity map. This map can be seen as a (Mo x No) array, where each element is a data structure composed by 3 values: • Disparity value along the xx axis. • Disparity value along the yy axis. • Similarity measure value of the correspondent pair of pixels. Since there is a direct relation between the obtained correlation values and the achieved matching performance, the similarity measure sub-array can be regarded as a confidence map of the final result. Therefore, it can be used to select the pixel coordinates corresponding to the best match of the whole process. 6.5. Experimental Results The comparative analysis presented in this research was based on a software implementation of the described algorithms using the object oriented language C + + , running on Linux and Windows NT workstations. In the following subsections, it will be described the experiment layout and presented the achieved experimental results. 6.5.1. Experiment

Layout

In the performed comparative analysis, two image pairs representing a scene taken at planet Mars and an aerial view of the Pentagon have been used (see figures 6.6 and 6.7). These scenes are considered good examples of high textured pictures, well suited for area-based matching algorithms evaluation. In the several computations carried out along this research, two main aspects were used to assess the several algorithms: matching error and computational load. To obtain a fair comparison of the obtained disparity maps, the final registration error of each similarity function has been used to assess the

A Comparative Analysis of Cross-Correlation

Matching Algorithms

131

result of the several algorithms. This common measure was estimated by computing, for each pixel of the right image, the sum of the squared differences (eXy) between all pixels belonging to a rectangular window of size (K x L) of the right image (/ (x,y)) and all pixels belonging to the corresponding window of the left image (g(x,y)), defined with the disparity vector (dgcdy): +4|-—i-j-|f—i e

*v=

5Z

5 1 [f(x + i,y + j)-9 (x + i + dx,y+j + dy)]2 (6.11)

By evaluating the square root of this sum and by dividing it by the area of the considered window, it was obtained a value which quantifies the resultant matching error in the pixel domain (see eq. 8.12). The matrix E (x, y) composed by all these E (x, y) values is denominated by error map (see eq. 6.13). Moreover, by accumulating all these E (x, y) values and normalizing the result with the total image area, it was obtained the value E which best characterizes the performance of a given algorithm (see eq. 6.14).

Fig. 6.8. Left and right images of Sojourner, taken from Pathfinder lander camera at planet Mars.

Fu; 6 7

Lf ft aiid light images of an aerial view of the Pentagon

Roma, Santos-Victor

132

and Tome

The E values of the several algorithms were used in the comparative charts presented in the following subsections. E(*,y) ~E{x,y) = {E(x,y) £=

= j

^

; 0 < x < M ; 0
(6-12) (6.13)

£ £ , £ £ , £(*,y)

(fU4) v M x N ' In what concerns the computational load, the implemented program was developed in order to provide a statistical study of all arithmetic operations performed along the disparity map estimation, namely, multiplications, additions and other similarity measure specific functions, such as square roots, absolute values and integer divisions. Furthermore, the performed analysis was also focused on the study of different aspects related to matching estimation using a pyramidal structure. The complete set of similarity functions presented in table 6.1 was used in this analysis, making it possible to evaluate the specific characteristics of each function in terms of final disparity error and computational load. The parameters considered were: • Pyramid depth, i.e., the number of layers used in the hierarchical scheme; • Pixel mobility, denning the maximum allowed value of each disparity vector, and controlled by adjusting the search window width parameter; • Reference window width; • Computational load distribution among the several stages of the system.

In the following subsections, it is presented the set of comparative results achieved using the Mars stereo image pair. These results are entirely similar to those obtained using other test images in what concerns all the considered aspects described above. 6.5.2. Disparity

Maps

In figure 6.8 it is presented a graphical representation of the components of the disparity vectors along the xx and yy axis. These matrices were obtained applying the ZNCC similarity function with a 2 layered pyramidal structure, a 64 pixels width search window and a 13 pixels width reference window to the Mars stereo image pair. The observed gradual increase of the disparity vectors amplitude in the y direction along the xx axis conforms with the expected behavior. Since these vectors correspond to image points

A Comparative Analysis of Cross-Correlation Matching Algorithms farther from the camera system position, they present more significant coordinates differences. The representation shown in figure 8.9 presents these matrices in a more intuitive way. In figure 6.10 it is presented the correlation map corresponding to this configuration and the error map obtained from eq. 6.14. In this figure it is possible to distinguish image areas with significant higher values of the disparity error, namely, at the borders of the graph. This higher values are usually due to non-overlapping regions of the image pair. Some of the peaks found in the middle of the error map are often caused by matching mistakes or occluded objects. Several solutions have been proposed in the literature to eliminate these matching mistakes 11 . The approach considered in the present research was based on the previous described Smoothness constraint of the disparity field. It consists in performing a median filtering stage before applying the interpolation step on the disparity field obtained at each level of the pyramidal scheme. The purpose of the filter is to remove abrupt maximum and minimum peaks from the field. Unfortunately, the achieved results have shown effective improvements of only 5% on the final disparity error. Therefore, it was decided to disable this intermediate block of the processing scheme in the subsequent study of the required computational load presented in the next subsections.

6.5.3. Disparity

Error

In figure 6.11 it is shown the variation of the disparity error with the allowed pixel mobility. In this and in the following charts it has been defined the Disparity map along the xx axis

Disparity map along the yy axis

Fig. 6.8. Disparity values along the xx and yy axis.

133

Roma, SantoB-Victor and Tomi

134

84

I ®*

94

85

96

97

W

88

100

Fig. 6.9. Disparity Matrix. Correlation Map

Pig. 6.10.

Disparity Error Map

Correlation and Error Maps.

Fig. 6.11. laiuence of the allowed pixel mobility in the inal disparity error.

measure FtM-MobiUty (FM) to designate the largest allowed value of any disparity vector of the estimated map. This value corresponds to the limit

A Comparative Analysis of Cross-Correlation Matching Algorithms

SCC

KCl.

=M5C

AH'"

KZSRD

SSS

".-iSD

CA'J

/S'}3

/SA!)

-.'JSO

135

[iW

Fig. 6.12. Influence of the pyramid depth in the inal disparity error.

situation that occurs when there is a match between the lower-left pixel of one image and the upper-right pixel of the other image, In a non-hierarchical scheme this limit situation would give rise to the usage of a search area four times greater than the original one. For the majority of the considered configurations it is always possible to find the one corresponding to the best match. In the example given above, it corresponds to the situation where the maximum allowed disparity value is FM/8, which corresponds to a 32 pixel width search window. The increase of the error with smaller windows was already expected, since the probability to find the correct matching pixel in the other image decreases. The observed increase of the disparity error when larger search windows were used is probably due to an increase of the image features ambiguity. Figure 8.12 represents the variation of the disparity error with the number of levels used pyramidal structure. The presented values evidence better results obtained with fewer hierarchical levels, as a consequence of distortions and losses of essential features required to perform the matching of image regions when too coarse resolutions are used. Figure 6.13 describes the influence of the reference window width in the disparity error. The obtained results evidence the existence of an optimal value for this parameter corresponding to the best configuration. For the example given above, reference windows with 11 or 13 pixels width have shown to be more favorable. For smaller windows, the increase of the error value can be justified due to a substantial increase of the matching ambiguity, since it becomes very difficult to distinguish the several image features when too small reference windows (such as 3 x 3 pixels width windows) are used. The increase of the disparity error when larger windows are used

Roma, Santos-Victor

136

see

Fig. 8.13.

flee

ZJCC

MCW

puaso

sso

frasn

and Tome

KM

asso

7=Ar

i-ssr.

. s.'-\

Influence of the reference window width in the final disparity error.

can be explained by the increasing influence of the differences found in the image pair due to the different points of view of the acquisition camera system, making it difficult to find the correct match. The previous charts also provide useful information to compare the considered similarity functions. Rrom these charts, it is possible to conclude that by using a convenient value for the search window width and reference window width, the final results can be very similar. However, better results can be obtained with SSD, NSSD and SAD similarity functions. The simple cross-correlation function (SCC) provided the worst performance in terms of disparity error. Consequently, its disparity error was not considered in some of the comparisons presented above. 6.5.4. Computational

Load

In figure 6.14 it is presented a statistical comparison of the required number, of multiplications, additions and other operations, for different configurations of the maximum allowed pixel mobility parameter. In these and in the following charts, the general designation Other corresponds to the computation of: * Other (a, b) - -A- in NCC, ZNCC, tions;

NZSSD

and NSSD similarity func-

« Other (a, b) = f in MOR and LSSD similarity functions; • Other (a) — \a\ in SAD, ZSAD

and LSAD similarity functions.

These charts clearly evidence the significant increase of the number of arithmetic operations with the mobility factor (reflected in a correspondent increase of the search window area) is responsible for a performed along the estimation process. As an example, it is possible to verify that a search

A Comparative

Analysis of Cross-Correlation

Matching Algorithms

137

window corresponding to FM/l implies a computational load 8000 times higher th window corresponding to F M / 3 2 . In figure 8.15 it is shown a statistical comparison of the required number of arithmetic operations to estimate the dense disparity field using 0 (original resolution), 1, 2 and 3 hierarchical levels. These results evidence the significant advantages of using a hierarchical approach. They allow us to conclude that the computational load of a 3 layered scheme can be 500

&(\

» (

ZW C

WOP

fffSSD

SEi.

NSCu

a«>

TSoQ

2 r AG

i SSi)

1 "M

(a) Multiplications

Fig. 6.14.

(c) Other InBuenee of the allowed pixel mobility in the number of arithmetic operations.

138

Roma, Santos- Victor and Tome

times lower than the required by a plain structure approach. The exceptional lower number of multiplications required by SAD similarity function was already expected because this operation is only performed in the filter and interpolation stages of this registration algorithm. In figure §.16 it is presented the variation of the computational load with the width of the reference window, emphasizing the general increase of

(a) Multiplications

(b) Additions

'•MIL

vt

AF-O

/**?.

(c) Other

Fig. 6.15. Influence of the pyramid depth in the number of arithmetic operations.

A Comparative

Analysis of Cross-Correlation

Matching Algorithms

139

the number of arithmetic operations performed in the matching estimation stage when the reference window is enlarged. Finally, in figure 6.17 it is. presented the distribution of the global computational load into the three main stages of the system: Similarity com-

gS&l

/5*1

LK°D

(a) Multiplications

K '

•=•(

'•'

(b) Additions

SCC

ViC

(c) Other Fig. 8.16,

Variation of the number of operations with the width of the reference window.

140

Roma, Santos-Victor

and Tome

putation, Gaussian filtering and Disparity map interpolation. These charts clearly show that the similarity measure computation is responsible for the major part of the operations performed in the estimation process. Moreover, the overhead of the several auxiliary stages required to perform a pyramidal processing scheme has an almost insignificant weight in the overall computational load. Therefore, the results presented in this subsection can be regarded as a very important comparison basis to select the most convenient similarity function and the most suitable set of parameters of the pyramidal processing scheme to estimate the dense disparity map of a given stereo image pair. Likewise, they can also be regarded as a primordial basis for taking a decision in the critical tradeoff between computational load, saving and final disparity error minimization.

«?C

NX

-NCC

Mo"

'..*>::

ftSV

iZZ

&-i,

2I-A"

"-i"

LSHJ

(a) Multiplications

(b) Additions Fig. 8.17. system.

Distribution of the multiply

and add operations in the several stages of the

A Comparative Analysis of Cross-Correlation Matching Algorithms

141

6.6. Conclusion In this research work it has been presented a comparative analysis of several area based similarity measure functions, using a pyramidal resolution scheme. The similarity functions analyzed constitute a set of twelve different cross-correlation based matching algorithms. Among this set, it has been shown that better results can be obtained when zero-mean normalized similarity functions are used, such as ZNCC and MOR. Dissimilarity functions like SSD and SAD have also proved to give rise to good results. The presented research has also shown the influence of several parameters of this hierarchical scheme on the overall Disparity Error and Computational Load performances. Among this set of parameters, the number of layers, the search window width and the reference window width have proved to be significant parameters to achieve a convenient tradeoff between the two referred factors. Moreover, some registration algorithms based in the computation of dissimilarity measures using the absolute value operator (SAD, ZSAD and LSAD) have shown to be very cost effective in what concerns the required computational resources. This fact is tied in with the low required number of multiplication operations, usually associated to higher demanding resources, in what concerns arithmetic units and processing time. As a consequence, recent implementations based on VLSI circuits or Digital Signal Processors have adopted this set of registration algorithms. Acknowledgements The stereo images of Sojourner, taken at planet Mars by Pathfinder lander were obtained at: http://www.psc.edu/Mars/default.html The stereo images of Pentagon and Plants were obtained at: http://www.ius.cs.cmu.eduZ/ idb/html/stereo. References 1. A. Redert, E. Hendriks, J. Biemond, "Correspondence Estimation in Image Pairs", IEEE Signal Processing Magazine, May 1999, pp. 29-45. 2. A. Arsenio, J. S. Marques, "Performance Analysis and Characterization of Matching Algorithms", Proc. of the 5 International Symposium on Intelligent Robotic Systems, Stockholm, Sweden, July 1997. 3. M. G. Strintzis, S. Malassiotis, "Object-Based Coding of Stereoscopic and 3D Image Sequences", IEEE Signal Processing Magazine, May 1999, pp. 14-28.

142

Roma, Santos-Victor and Tome

4. P. Aschwanden, W. Guggenbiihl, "Experimental Results from a Comparative Study on Correlation-Type Registration Algorithms", in Forstner and Ruweidel, eds., Robust Computer Vision, pp. 268-282, Wichmann, (1992). 5. E. Grossmann, J. Santos-Victor, "Performance Evaluation of Optical Flow Estimators: Assessment of a New Affine Flow Method, Etienne Grossmann", VisLab-TR 07/97 - Robotics and Autonomous Systems, Elsevier, July 1997. 6. M. O'Neill, M. Denos, "Automated System For Coarse-To-Fine Pyramidal Area Correlation Stereo Matching", Image and Vision Computing, vol.14, 1996, pp. 225-236. 7. O. Faugeras et. others, "Quantitative and Qualitative Comparison of some Area and Feature-Based Stereo Algorithms", in Forstner and Ruweidel, eds., Robust Computer Vision, pp. 1-26, Wichmann, (1992). 8. H. Hseu, A. Bhalerao, R. Wilson, "Image Matching Based On The Cooccurrence Matrix", 1999. 9. R. Gonzalez, R. Woods,"Digital Image Processing", Addison-Wesley, 1993. 10. D. Ballard, C. Brown,"Computer Vision", Prentice-Hall, Inc., 1982. 11. C. Sun, "Multi-Resolution Rectangular Subregioning Stereo Matching Using Fast Correlation and Dynamic Programming Techniques", August 1998. 12. R. Schalkoff,"Digital Image Processing and Computer Vision", John Wiley & Sons, Inc., 1989.

CHAPTER 7 P e r f o r m a n c e E v a l u a t i o n of M e d i c a l I m a g e P r o c e s s i n g Algorithms

James C. Gee Department of Radiology, University of Pennsylvania 3400 Spruce Street, Philadelphia, PA 19104, USA E-mail: [email protected]

Modern imaging techniques in medicine have revolutionized the study of anatomy and physiology in man. A central factor in the success and increasingly widespread application of imaging-based approaches in clinical and basic research has been the emergence of sophisticated computational methods for extracting salient information from image data. The utility of image processing has prompted the development of numerous algorithms for medical data, but these have largely remained research tools and few have been incorporated into a clinical workflow. A primary cause of this poor track record is the lack of validation of these methods. A workshop was held at the Image Processing Conference of the SPIE Medical Imaging 2000 Symposium to discuss and stimulate developments in performance characterization research for medical image processing algorithms. This report presents highlights from the workshop presentations and from the panel discussion with the audience. 7.1.

Introduction

At the S P I E Medical Imaging 2000 Symposium, the Image Processing conference highlighted the subject of performance validation by featuring Robert Haralick in the keynote presentation and t h r o u g h t h e organization of a workshop t h a t expanded the discussion introduced in the keynote lecture. T h e workshop panel included: L a u r e n c e Clarke, National Cancer Instit u t e ; M i c h a e l F i t z p a t r i c k , Vanderbilt University; J a m e s G e e (Moderator), University of Pennsylvania; R o b e r t H a r a l i c k (Co-moderator), University of Washington; D a v i d H a y n o r , University of Washington; V i s v a n a t h a n R a m e s h , Siemens Corporate Research; a n d M a x V i e r g e v e r , 143

144

Gee

Image Sciences Institute, University Hospital Utrecht. The workshop took the format of presentations by the panel members, followed by a moderated discussion with the audience. This report presents highlights from both the presentations and the panel discussion. Additional written contributions 1>4'5'6 from several of the panelists can be found in the proceedings for the symposium, which together with this summary provide a concise record of the workshop proceedings.

7.2. Presentations 7.2.1. New NCI Initiatives

in Computer-Aided

Diagnosis

A major barrier to translating image processing research into application has been the lack of common datasets with which to evaluate proposed methods. Dr. Laurence Clarke, branch chief of Imaging Technology Development at the Biomedical Imaging Program of the National Cancer Institute (NCI), described several federally sponsored initiatives on image databases which aim to address this need for data standards and resources. One example is the National Library of Medicine/National Science Foundation (NSF) Visible Human Project,3, which makes available CT, MRI and cryosection images of a man and a woman, and is developing a supplementary toolkit of validation methods and materials that will facilitate the dataset's use in evaluating image registration and segmentation algorithms. Another example is the Human Brain Project (HBP), b which receives support from the National Institutes of Health, the National Science Foundation, the National Aeronautics and Space Administration, and the Department of Energy. Toward the HBP's aim of creating neuroinformatics tools to advance neuroscience research, the interoperability of the developed software and databases has become a priority of the program. Dr. Clarke then discussed a new NCI initiative to develop a spiral X-ray CT database for lung cancer screening in which the definition of standards will be considered from the outset of the project. A primary application of the database is to support validation of methods for computer-aided diagnosis (CAD). The standards and database will be established by a consortium of investigators under an U01 cooperative agreement. Details a b

http://www.nlm.nih.gov/research/visible/visible_human.html http://www.nimh.nih.gov/neuroinformatics/index.cfm

Performance

Evaluation

of Medical Image Processing Algorithms

145

of the agreement as well as the rationale and specific goals of the initiative are described in Dr. Clarke's article * in the proceedings for the symposium. Dr. Clarke is concerned with the scale of image processing requirements for the anticipated growth of applications such as microPET and CT for small animal work and urged the professional societies to assume leadership on the issue of algorithm evaluation. For example, he would like to see SPIE go further on the issue to cover the several steps toward the pathway to clinical validation and to alert young investigators as to the problems that many encounter. 7.2.2. Performance Characterization Analysis Systems at Siemens

of Image and Video Corporate Research

The number of commercial products that employ image analysis as part of their solution to real-world problems continues to grow. The methodology by which these commercial systems are validated is therefore especially relevant as the medical imaging research community begins to more rigorously address the issue of performance evaluation of its work. Dr. Visvanathan Ramesh described how Siemens Corporate Research (SCR) is tackling the characterization of algorithms for image analysis systems in terms of their effect on total system performance. A systems engineering methodology is applied that comprises component identification and application domain characterization. In the component identification step, the deterministic and stochastic behavior of each algorithmic module of the system is determined by studying its reponse to idealized models of the input. This approach effectively considers each module as an estimator, and the challenge in characterizing the system as a whole is in deriving the joint distribution function that models the combined behavior of its modules. Application domain characterization is an additional step in which task-specific constraints on the input data are learned. The constraints are expressed as probability distributions on the relevant algorithmic or system parameters, and amount in general to restricting the space of input images for the application. Once the preceding models are available for a system, its expected performance can be formally evaluated and system parameters optimized. 6 At SCR, white and black box analyses are both applied by first developing comprehensive models of the application domain. The models are then used to guide system design as well as to characterize its performance.

Gee

146

White box analysis—based on tools for propagating distributions and covariances and for modeling perturbations—has been used at SCR to develop systems for people detection and tracking, face recognition, and recovering structure from motion. By additionally obtaining a quantitative measure of the uncertainty in the output of these systems, SCR developers can gain insight into the performance of their designs. Examples of SCR applications for which black box testing was performed to optimize parameter settings include MRI cardiac boundary segmentation for quantification of ejection fraction and motion stabilization in X-ray coronary angiography for stenosis tracking. For the cardiac application, values for 37 system parameters were evaluated by measuring the Hausdorff distance between the computed contours and expert delineations on 224 images collected from 11 patients. The analysis sought to determine the set of values which consistently produced small errors so that user intervention would be minimized in practical settings. In black box tests for the angiography application, each set of 14 parameter values was run on 10 stenosis studies and evaluated using the Euclidean distance between the estimated and expert denned locations for each stenosis. The object of the analysis was to identify a value set that stabilized the stenosis to within 15 pixels—the amount tolerated during clinical interpretation—for all of the studies. In his summary, Dr. Ramesh reiterated the importance of component identification in the characterization of total system performance, and suggested that the research community should devote more attention to whitebox analysis in this regard. He also asserted that improved application domain modeling must be a priority in order to allow re-use of algorithmic modules across applications. Dr. Ramesh concluded by noting that a hallmark of successful translations of research systems to imaging products is the achievement of both speed and accuracy in final performance, and that such performance standards can only be met through total system characterizations. Further information can be found in Dr. Ramesh's article 6 in the proceedings for the symposium. 7.2.3. Validating

Registration

Algorithms:

A Case

Study

One of the most widely cited studies on peformance validation in medical image processing is the evaluation of registration methods conducted by Dr. Michael Fitzpatrick and his student Jay West at Vanderbilt University, 7

Performance

Evaluation

of Medical Image Processing Algorithms

147

and he shared in his presentation some of the lessons learned from that study. The project started with a definition of the registration problem, which was taken to be the determination of corresponding points in two different views of an object. Registration is to be distinguished from fusion, which is defined as the integration of registered images. Fusion may require reslicing, resizing, interpolation, or the generation of new images; whereas, image registration involves only the calculation of a spatial transformation. A specific task was considered for validation: the rigid registration of brain images (CT-MR and PET-MR) from the same patient. The images were of patients afflicted with brain tumors because the clinical collaborator, who provided the data, happened to be a tumor surgeon. The ground truth was established using a prospective method for registration, in which fiducials were located in the images and then aligned. The project received funding0 from the NIH to evaluate registration algorithms that only utilized naturally occurring information in the images. Sixteen such methods for retrospective image registration were initially studied through the participation of 18 investigators at 12 sites located in 5 countries. The test dataset contained CT, T l , T2, PD, and P E T images of 9 patients, totaling 61 image volumes. Over a thousand registration results were evaluated in the project. Dr. Fitzpatrick enumerated several problems that were encountered during the conduct of the validation study, and these are described below. Problem 1: Blinding the participants. For scientific rigor, the participants were blinded during the evaluation. This was accomplished by distributing the test data to the investigators as opposed to gathering their algorithms. Data distribution was possible because the fiducial markers could be removed from the images. In addition, image registration is not something that humans do well, so having access to the data did not translate to any advantage for the investigators. Moreover, collecting and running the algorithms at Vanderbilt would have been extremely difficult: the set up of each method may require skill and some techniques involve user interaction. Problem 2: Communication. In order to streamline the communication between the participating sites and Vanderbilt, the following protocol was

c

"Evaluation of retrospective image registration," NIH R01 NS33926, J. M. Fitzpatrick, P.I.

148

Gee

adopted: • • • • • • •

Remote site sends participation form Vanderbilt sends password Remote site obtains images by F T P Remote site performs registrations Remote site emails transformations Vanderbilt evaluates transformations Vanderbilt emails error summary

Problem 3: Transformation format. The problem of specifying without ambiguity the rotation and translation parameters of a rigid transformation is not trivial. The reference space for the transformations need to be agreed upon and so do the location of the image origin, directions of axes, rotation sense, and length units. A transformation template was therefore devised that provided the coordinates of 8 vertices of a parallelepiped as well as fields to be filled in for the coordinates of the vertices' corresponding positions under the submitted transformation. Problem 4: Measuring performance. The methods were evaluated by measuring their registration error for 10 evenly distributed targets across the brain volume that had been selected for neurological interest by a neuroradiologist. An important advantage of this approach is that it does not involve fusion and consequently avoids the confounding issues introduced by operations required by fusion such as reslicing or rendering. Problem 5: Validating the standard itself. The prospective registration method used to define the ground truth is itself imperfect because of errors in localizing the fiducial markers. Nevertheless, Dr. Fitzpatrick was able to determine the theoretical relationship between the fiducial and target registration errors, and use this result to validate the gold standard obtained with his prospective method. Problem 6: Reporting results. The median error was reported to prevent outliers from skewing the evaluation; however, large outliers when present were also reported. Furthermore, the gold standard TRE was published so that a benchmark would be available. These measures were considered more informative than the mean and standard deviation for assessing the practical efficacy of the methods. Because there was surprisingly little variation in the errors from target to target, the results reported for each patient were defined over all of the targets. Dr. Fitzpatrick cautioned that

Performance

Evaluation

of Medical Image Processing Algorithms

149

the issue of authorship had to be considered very carefully for large projects in which many investigators are involved. Problem 7: Submission errors. Despite the mechanisms put into place to help reduce errors in the submission of registration results, mistakes inevitably occurred: one site, for example, had all the angles negated; two sites submitted transformation tables that were incorrect; and another site introduced an error during manual entry. These errors were addressed by double reporting along with the provision of an explanation for each case. Problem 8: What hypotheses should we test? Dr. Fitzpatrick suggested that the statement of a hypothesis implicitly introduces bias into a study, and given the comparative nature of the evaluation that was planned such a bias would have been detrimental to his ability to carry out the experiment. His solution therefore was to conduct the study without setting out to test any prospective hypotheses. Another difficulty with hypothesis testing in validation studies is the low statistical power that results when a large number of algorithms is examined, as was the case in the registration evaluation. Two solutions were pursued: the use of descriptive statistics to summarize the results; and the use of experimental designs based on data pooled from similar methods. Problem 9: Complexity. In addition to those already described, a host of other problems arose from the complex interactions that are inherent in any large undertaking involving sophisticated methods and numerous participants. To help manage this complexity, Dr. Fitzpatrick emphasized the requirement of detailed documentation and the use of ASCII in header files and transformation tables. Only one training dataset was available and he felt more should have been provided. A sham site was set up at Vanderbilt and this proved to be a valuable testbed for discovering and correcting errors. Nevertheless, in a little less than 2 years, the project group had to respond to over 300 e-mail messages having only to do with problem solving. Patience from all parties is therefore key to the viability of any such project. Problem 10: Is it worthy of funding? The last problem identified by Dr. Fitzpatrick is the lack of consensus among funding sources on whether algorithm evaluation is worthy of support. The specification of hypotheses, as he argued earlier, may not only be inappropriate but can also handicap such studies. This presents a significant barrier to funding support from the

150

Gee

NIH, which has traditionally promoted research that is hypothesis-driven. Dr. Fitzpatrick summarized the positive as well as negative aspects of the approach used in the Vanderbilt study. Specifically, the decision to coordinate the evaluation over the Internet was considered to be a good one and so was the design to blind the participants. However, to encourage participation and to build confidence in the project, the investigators were unblinded after the study, which unfortunately introduced another set of potential problems. Other positive elements of the evaluation approach were the establishment of a sham site, the involvement of the original investigators in publication of the results, and the double reporting of errors. The fact that the study avoided hypothesis testing was deemed both an advantage (scientifically correct) and a disadvantage (may seem nonscientific). The motivation for dealing with the myriad problems associated with studies of performance evaluation is that they yield enormous benefits. The Vanderbilt project established the state of the art (circa 1997) in rigid registration of multi-modality studies; facilitated a paradigm shift from surfaceto voxel-based techniques; tempered faith in the methods (some showed surprisingly large errors); leveled the playing field and opened competition to "beginners"; and provided a benchmark for publications. Dr. Fitzpatrick concluded with the encouraging message that performance evaluation works, its problems require attention to detail, it benefits the field, and if one has the time (and the funding), validation is worth the effort. Further information about the Vanderbilt study can be found in Fitzpatrick and West 2 and West, et al. 7 7.2.4. Performance Evaluation Algorithms in Medicine:

of Image A Clinical

Processing Perspective

Dr. David Haynor, a neuroradiologist at the University of Washington with a longstanding interest in image analysis research, described some of the criteria that are important for the clinical use of an algorithm and how they differ from certain elements emphasized in the computer vision community. In motivating the need for special attention to be devoted to performance evaluation, Dr. Haynor noted that the genesis of an idea and its raw implementation in code constitute only a very small fraction of the effort required to translate an algorithm into a clinical workflow. The rate limiting step is validation of the algorithm. A key criterion of any final evaluation of a method is that it must

Performance

Evaluation

of Medical Image Processing Algorithms

151

model the clinical setting. The emphasis ordinarily placed on automation in clinical applications is due to the fact that the final clinical utility of an algorithm is partly related to the amount of time and human effort that is required to obtain a satisfactory result. Consequently, the evaluation has to reflect true clinical use. This requirement points to a weakness in the kinds of validation studies reported by Dr. Fitzpatrick 7 (see Sec. 7.2.3), where experts and not ordinary users operated the algorithms and even then, mistakes were made by the operators. Another important validation criterion is that the experimental noise levels and structure should be realistic. The methods of analysis described by Dr. Ramesh involve studying the effects of small perturbations using linearizations around particular operating points (see Sec. 7.2.2). These techniques of infinitesimal analysis are generally insufficient in themselves for a robust clinical evaluation. Dr. Haynor sought to remind the audience of the methods that the statistics community has developed over the past 10-15 years for analyzing highly complicated statistical models. Although it is usually straightforward to derive the likelihood model of data given the ground truth, going "backwards"—which is the parameter estimation problem—is frequently much more difficult. The solution that has been arrived at in the statistics community is to conduct Bayesian sampling from the posterior distribution of the ground truth given the data. From a clinical perspective, some examples can be given of what would constitute a useful performance characterization. For instance, in applications of detection, it is important to know the false positive and false negative rates: in CAD where potentially many suspicious areas are highlighted by an application, one would like to know how trustworthy a witness the program is (what is its expected false positive rate) in order to calibrate one's interpretation of the results. For tasks where humans disagree but are the only source for ground truth, performance of an algorithm should essentially be within the human range. In other words, a kind of Turing test is relevant: the final performance of the algorithm should be indistinguishable from the performance of human observers. In addition to adopting documentation of validations for clinical datasets or methods as a Good Manufacturing Practice, a valuable resource would be the availability of defined procedures for algorithm validation on standardized datasets to facilitate comparison of results. The experimental set-up

152

Gee

used in the Vanderbilt study, for example, has the desirable feature that cheating even if possible is more effort than its worth. Dr. Haynor raised the crucial point that in the clinical arena, emphasis in performance characterization changes from error minimization to error prediction and control. A clinician will be more interested in controlling the size of an error than necessarily in minimizing it. It would be extremely important, for example, to avoid outliers when one moves from an imagebased analysis to an essentially blind stereotactic procedure. Algorithms should not only be characterized with some notion of error bounds and confidence intervals (estimated using Bayesian techniques, for example), they should also possess the capability of directing the user to areas of the output where there is doubt; for example, if tracing contours, to point to where the contour is particularly uncertain. This becomes an issue when datasets become too large for humans to review quickly. For the same reason, methods for rapid visual evaluation of results would be very helpful. The whole emphasis of these features and capabilities is to prevent disastrous errors. As have been emphasized by others, one can make the argument that the amount of human effort that is required to bring the results with certainty into the realm of clinical utility is probably a better error measure than the absolute error value in many cases. Finally, Dr. Haynor suggested that from a practical viewpoint the preferred algorithms would be those that could accept "gentle advice" from a user if the initial result is unsatisfactory. Further information can be found in Dr. Haynor's article 5 in the proceedings for the symposium. 7.2.5. Performance

Evaluation:

Points for

Discussion

Dr. Max Viergever, who directs the Image Sciences Institute at the University Hospital Utrecht, provided another expert perspective on performance validation from the academic research community. In particular, he raised a number of points for consideration during the discussion session, and these are summarized below. Dr. Viergever discerned a common theme in Dr. Haynor's requirement that an evaluation must reflect clinical use and Dr. Ramesh's emphasis on application domain modeling in algorithm development and validation. Dr. Viergever went further with this theme by asserting that identification

Performance Evaluation of Medical Image Processing Algorithms

153

of the task is paramount to the relevance of any evaluation. Such studies are considered meaningful only when they occur within the context of a particular task, such as quantitation, presentation, or decision making. Moreover, it is the task specification that introduces the particular issues—for example, type of imaging modality or performance requirements—which must be addressed by the algorithms. This approach takes the position that in medical imaging the assumptions underlying an analysis (model verification) are generally more critical for the evaluation than the exact correctness of its algorithmic implementation (code verification). Dr. Viergever advocated the use of simulated or synthetic data for algorithm evaluation but only to check the consistency of a method because the degree of realism in simulations is usually too low for more extensive validation. A better approach is to use data acquired from physical phantoms or cadavers, but these have their limitations as well. Phantom data are in general not sufficiently realistic to be representative of actual clinical data. There are also a variety of tasks for which phantoms are extremely difficult to create and cadavers are inappropriate. Dr. Viergever believes that rigorous evaluation ultimately requires the use of patient data. Ground truth must therefore be provided and this usually entails substantial work on the part of a number of experts. Since the ground truth is defined by humans, there will be disagreement and the process will also be expensive. Nevertheless, expert labeling represents the state-of-the-art in establishing ground truth, and funding for this critical activity should be a high priority. Given the scale of effort that would be required for clinically based evaluations for every new type of algorithm, one alternative that has been proposed is to rely on reference databases. Dr. Viergever suggested that these databases will be inadequate because their content would be appropriate mainly within the specific context of the original problems for which they were designed. As an example, he pointed to the Vanderbilt dataset, which is useful for evaluating a particular class of registration methods on a restricted set of imaging studies acquired from a specific part of human anatomy.

154

Gee

7.3. Panel Discussion d Has there been consideration within the medical imaging community of the operational research and simulation work developed for tactical aircraft design at the defense department? D H responded that, in the context of understanding medical errors, there has been a great deal of interest in simulations of quality control measures that have been used in aircraft manufacturing, but he was not aware of any work in which those techniques had been applied to algorithm evaluation. Further comments were requested on the panelists' differences in viewpoint about the utility of reference databases. M V stated his presentation was intended to introduce some controversy in hopes of spurring discussion but nonetheless believes that it will be difficult to implement sufficient database coverage given the rapid rate at which new modalities and applications are being introduced. LC added that the NCI project will be developing a retrospective database, whereas algorithm validation would ideally be conducted using prospective data. He notes that one therefore has to distinguish between the different stages of evaluation. An analysis based on retrospective data, for example, allows a reference point to be established for an algorithm's performance. Another difficulty with databases is that the technology used to create them is continually evolving. The research task charged to the consortium responsible for the NCI database will thus need to similarly evolve in order to accommodate new methodologies and modalities in terms of the generation and evaluation of the database. R H cautions that politics plays a significant role in performance evaluation because of the potential conflict of interest that arises when an investigator particates in a study whose outcome could possibly harm his or her source of funding. To see the potential impact of this political factor, R H first described how the availability of large-scale, suitably annotated databases for algorithm evaluation in speech recognition and in natural language processing has benefited those fields. Once methods were put to the appropriate kinds of testing, performance began to increase more rapidly than before. In contrast, the program in computer vision at DARPA was eventually Panelists are identified by their initials, which are shown in boldface; questions and statements from the audience are similarly highlighted in boldface.

Performance

Evaluation

of Medical Image Processing Algorithms

155

closed in part because the participants who were well funded argued exactly against conducting performance characterization and evaluations. A variety of objections were raised regarding the nature of the evaluation tasks and their domain, all of which had a degree of validity but, in R H ' s opinion, tended to make for weak arguments scientifically. R H asserts that it is the incremental increase in performance that moves a field forward and that this performance increase is best realized when researchers are put in competition, one with the other, on a suitable performance measure. The DARPA program failed because the lack of competition yielded rates of performance improvement that were too slow. R H concluded that in spite of its many limitations, performance evaluation is worthwhile because it promotes a more conscientious approach toward algorithm development and refinement that ultimately results in better techniques. W h a t were the considerations given t o the training and test data in the Vanderbilty study? M F responded that only one training case was supplied but only because data was too scarce to provide more. There are plans for a much larger training dataset in a new project, and this will be important for drawing commercial partners into the fold of a larger study. R H related an experience where his laboratory was funded to build a document image database and how he prevailed against the wishes of various commercial organizations to have the database created to a level at which none of the existing systems could actually utilize all of its detail. The database, which was issued in 1995-6, will consequently be useful until 2005. The suggestion was made that it may be as difficult to settle on one, two or a handful of methods for "universal" validation as it is to define one, two, or a handful of universal databases. Regarding the analysis approach that involves analytic methods, how are dynamical models handled? RV responded that techniques exist to model dynamical systems and have been applied in a video application at SCR. He added that an eventual goal is to study image analysis systems with feedback, where the feedback can include user input. RV reiterated the importance of advancing theoretical approaches to performance evaluation. D H concurred that when an analytic solution can be obtained that is valid over a representative range of noise and so on, it will be superior to any kind of numerical solution just in terms of the rapidity with which

156

Gee

one can tune parameters or examine the effects of certain perturbations. He wanted to make the point though that when those techniques cannot be applied the posterior sampling methods will accommodate essentially arbitrarily complicated forward models, and that if one had the patience and a way of reducing the data, the methods can be extremely useful. R H raised a related issue with the question: "When you have a system with many internal parameters (e.g., appearance-based kinds of recognition systems), how many samples should you take?" If the number of samples is too small, the system essentially memorizes the data and there is no generalization. The situation is actually more complicated because two phases are involved in the evaluation. In the training phase, an optimization is performed to determine the internal parameter settings of the system. However, the parameter values that are found to optimize the system will not be correct because they are determined from a sample that will be small compared to the population in which the system is going to be applied. Then, in the testing phase, an infinitesimal sample is again taken from a population to carry out the evaluation. The goal therefore is to develop a system that is insensitive to deviations of the internal parameter values away from the ideal. This entails that the optimization must be broad. Note that when the sample size is small, the optimization tends to be peaked; whereas, when the sample size is large, the optimization turns out to be broad. Thus, it is possible to conduct sensitivity studies to empirically determine whether or not the sample size is sufficient for the system to perform well in its generalization mode. The effect of sample size was evident in recent studies involving the detection of microcalcifications and masses on mammograms where system performance decreased as more test data were processed. For the Vanderbilt study, which required the estimation of a small number of parameters (six to be precise) on a database of 19 studies, was the test dataset considered sufficiently large? M F could not comment on the adequacy of the test data but recognized that the single training case was insufficient. In response to a question about the task examined in the study, M F responded that it was to minimize the registration error over 10 positions chosen by a neuroradiologist on a set of patients that was very limited. M V suggested that the relevance of this task would depend on the specific requirements of the clinical application.

Performance

Evaluation

of Medical Image Processing Algorithms

157

Would not the sample size in the Vanderbilt study depend on the variance that one would like t o report to the surgeons? D H pointed out that there are other sources of error (e.g., settling of the brain) in stereotactic surgery besides those due to registration. A larger concern is that errors on the order of 6 mm if they are sporadic might be very important in terms of differentiating because they may incur a much higher cost in a Bayesian loss calculation than the difference between errors of 0.4 and 0.7 mm or 0.7 and 1 mm. The trouble is this is a perfect example of the kind of error that is extremely difficult to estimate. One may have a single occurrence out of a hundred or thousand registrations. As a result, when one instance of something is observed in one algorithm and none in another algorithm, one cannot make any conclusions at all about which algorithm is superior. This example reveals some real limitations in terms of algorithm evaluation that can only be addressed by turning to huge numbers of datasets, which is impractical. W h a t about the establishment of a public repository t o which validation data can be freely submitted and that is openly accessible? LC emphasized that a consensus must be developed on database generation so that the resultant dataset is accepted by the community at large as being representative. Equally important, the database must be validated in a standardized way so that it can be used to fairly compare algorithms. Most databases are constructed at a single site using the criteria it considers to be important. There is actually considerable disagreement on how these databases should be generated and many investigators when they apply for NIH funding with proposals that use or create a particular dataset for algorithm evaluation are critiqued on the design of their databases and not on the algorithms themselves. A major goal of the NCI database is to eliminate this issue so that algorithm performance can be focused on as a means of determining whether a proposal should be funded or not. R H observed that most researchers regard as one of their most precious assets in their laboratories the annotated data that they have painstakingly produced in conjunction with their research studies. There is a reluctance in sharing these data because this would essentially give other researchers the same opportunity to develop their own algorithms on what took so much effort for the laboratory that created the data. This kind of problem can in part be solved from the point of view of funding sources which indicate

158

Gee

that all databases created as part of a study become available to the public and not the property of the investigator. D H commented that one should not underestimate the problem of heterogeneity. Given data from two sites with different quality standards, it is not obvious how the data should be merged in a database or used to compare algorithms (should an algorithm agree more consistently with one dataset or with the other?). These difficulties help explain the appeal of running a specific data collection project under a consensus-based approach. D H again called attention to the importance of outliers in medicine and to the fact that those errors are also the most difficult to statistically characterize exactly because they are so rare. The designers of databases should therefore strive to include examples of hard or unusual cases as well as cases that can be predicted to yield failures for certain methods, although they need to be validated in the same way as the average cases that are entered into the database. In closing the workshop, R H hoped the message was clear that the production of validation databases and the conduct of performance evaluation are important issues and that when one considers those issues carefully, algorithm and experimental design will improve, thus moving the field forward. Acknowledgments This paper was originally published in the proceedings of the SPIE Medical Imaging 2000 Symposium (J. C. Gee, "Performance evaluation of medical image processing algorithms," Medical Imaging 2000: Image Processing, K. M. Hanson, ed., Proc. SPIE Vol. 3979, 2000). The author is grateful to Larry Clarke, Mike Fitzpatrick, Bob Haralick, Dave Haynor, V. Ramesh, and Max Viergever for their invaluable contributions to the workshop; to Ken Hanson and Krista Fleming for facilitating its organization; and to SPIE for generously supporting the panelists' participation in the workshop. References 1. L. P. Clarke, B. Y. Croft, and E. Staab, "New NCI initiatives in computer aided diagnosis," in Medical Imaging 2000: Image Display and Visualization, K. M. Hanson, ed., SPIE, Bellingham, pp. 370-373, 2000. 2. J. M. Fitzpatrick and J. West, "A blinded evaluation and comparison of image registration methods," in Empirical Evaluation Techniques in Computer

Performance Evaluation of Medical Image Processing Algorithms

3.

4.

5.

6.

7.

159

Vision, K. Bowyer and P. J. Phillips, eds., IEEE Computer Society, Los Alamitos, pp. 12-27, 2000. J. C. Gee, "Performance evaluation of medical image processing algorithms," in Medical Imaging 2000: Image Processing, K. M. Hanson, ed., SPIE, Bellingham, pp. 19-27, 2000. R. M. Haralick, "Validating image analysis algorithms," in Medical Imaging 2000: Image Processing, K. M. Hanson, ed., SPIE, Bellingham, pp. 2-16, 2000. D. R. Haynor, "Performance evaluation of image processing algorithms in medicine: A clinical perspective," in Medical Imaging 2000: Image Processing, K. M. Hanson, ed., SPIE, Bellingham, 2000. V. Ramesh, M.-P. Jolly, and M. Greiffenhagen, "Performance characterization of image and video analysis systems at Siemens Corporate Research," in Medical Imaging 2000: Image Processing, K. M. Hanson, ed., SPIE, Bellingham, pp. 28-37, 2000. J. West, J. M. Fitzpatrick, et al, "Comparison and evaluation of retrospective intermodality brain image registration techniques," J. Comput. Assist. Tomogr. 2 1 , pp. 554-566, 1997.

/ " / / ' h i s book provides comprehensive coverage of methods for /\Y\e empirical evaluation of computer vision techniques. The practical use of computer vision requires empirical evaluation to ensure that the overall system has a guaranteed performance. The book contains articles that cover the design of experiments for evaluation, range image segmentation, the evaluation of face recognition and diffusion methods, image matching using correlation methods, and the performance of medical image processing algorithms.

ISBN 981-02-4953-5

www. worldscientific.com 4965 he