Multimedia Techniques for Device and Ambient Intelligence
.
Ernesto Damiani Jechang Jeong Editors
Multimedia Techniques for Device and Ambient Intelligence
Editors Ernesto Damiani Dipartimento di. Tecnologie dell’Informazione Università degli Studi di Milano via Bramante 65 26013 Crema Italy
[email protected]
Jechang Jeong Department of Electronics & Computer Engineering Hanyang University 17 Haengdang-dong Seoul, Seongdong-Gu Korea
[email protected]
ISBN 978-0-387-88776-0 e-ISBN 978-0-387-88777-7 DOI 10.1007/978-0-387-88777-7 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2009926476 © Springer Science+Business Media, LLC 2009 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Ambient Intelligence used to be a vision of our future. Today, it is more like a central aspect of our life: we already live surrounded by a new generation of electronic devices, helping us in most of our work and leisure endeavors. Ambient Intelligence technologies combine concepts of ubiquitous computing and intelligent systems to put humans at the center of technological development. Many young people, in particular, are fond of ambient intelligence and seem to have found an easy and natural way to exploit the information and knowledge provided by the network connecting these devices and by the device themselves. In such a scenario, giving to the devices with the capability of extracting and processing multimedia information is crucial. Indeed, time-honored research areas such as video processing and image analysis are now experiencing a second youth, and play a crucial role in supporting the devices’ advanced multimedia capabilities. Multimedia Techniques for Device and Ambient Intelligence (MTDAI) is a edited volume written by well-recognized international researchers including, but not limited to, extended chapter-style versions of the papers presented at the homonymous MTDAI seminar that we started in 2008, in the unique setting of villa Braida at Mogliano Veneto, near Venice. The MTDAI seminar is intended to bring together, without the usual formalities of a conference, a number of top researchers from academia and industry interested in multimedia issues. MTDAI is based on short presentations and open discussions, fostering interdisciplinary collaboration and encouraging the exploration of new frontiers in the area of ambient and device intelligence. After the seminar, some MTDAI authors were asked to revise and extend their contributions, taking into account the lively discussion and remarks made during the seminar. Also, a call for chapter was published, attracting some interesting proposals of additional chapters. A rigorous refereeing was then carried out; the result is this book, presenting the state-of-the-art and some recent research results in the field of image understanding and its applications to device and ambient intelligence. The book is divided into two parts: the first part discusses new low-level techniques for image and video understanding, while the second part presents a series of novel applications, focusing on multimedia-oriented knowledge management.
vi
Preface
Putting together a book like this is always a team effort, and we gratefully acknowledge the hard work and dedication of many people. First of all, e appreciate the fundamental work of the MTDAI committee members, who accepted to handle the refereeing of the book chapters, and contributed with valuable comments and observation. We also would like to acknowledge the help, support and patience of the Springer publishing team. But even more importantly, we wish to thank the authors who have contributed their best research work to this volume. We believe that while fully attaining the rigorousness and originality one would expect from a scientific edited volume, their contributions retain much of the liveliness and appeal to nonspecialists which are a major feature of our MTDAI seminar. Milan, Seoul February 2009
Ernesto Damiani Jechang Jeong
Contents
Part I Low Level Approach for Image and Video Understanding 1
GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC . 3 Kangjun Lee, Gwanggil Jeon and Jechang Jeong 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 GOP Structure Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 MV Scaling in the Temporal Direction . . . . . . . . . . . . . . . . 5 1.2.2 Correlation Between the Current MB Mode and the Reference Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Adaptive Search Range Selection through the MV Linearity Test in the Temporal Direction . . . . . . . . . . . . . . 8 1.3.2 Adaptive Mode Decision Method Based on Region Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
Simple Low Level Features for Image Analysis . . . . . . . . . . . . . . . . . . . Paolo Falcoz 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Role of Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 HSL and HSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 CIE-Lab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Color Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Blob Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Simple Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Scale and Position Invariants: Procrustes Analysis . . . . . . 2.5.2 Shape Alignment: Iterative Closest Point . . . . . . . . . . . . . .
17 17 19 20 23 25 26 27 30 33 33 34
viii
3
4
Contents
2.5.3 Shape Encoding and Matching: Curvature Space Scale . . 2.6 Combination of simple features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 37 38 39
Fast and robust Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Anisetti 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Feature-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.1 Low level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.2 Skin-map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.3 Feature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1.4 Template based . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Appearance-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Face detection on video stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Efficient object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Appearance-based face detection . . . . . . . . . . . . . . . . . . . . . 3.3.3 Features-based face detection . . . . . . . . . . . . . . . . . . . . . . . 3.3.3.1 Adaptive Skin detection . . . . . . . . . . . . . . . . . . . 3.3.3.2 Eyes detection and validation . . . . . . . . . . . . . . 3.3.3.3 Face normalization . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43 43 44 44 45 48 49 50 52 52 53 56 58 59 61 64 64 68 69
Automatic 3D Facial Fitting for Tracking in Video Sequence . . . . . . . 73 Valerio Bellandi 4.1 The 3D Face Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 3D morphing basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 3D morphing basis for shape and expression . . . . . . . . . . . 77 4.2.1.1 Shape Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1.2 Expression Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Appearance basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.0.3 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.0.4 Image-based PCA . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 3D Illumination basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5 The General Purposes 3D Tracking Algorithm . . . . . . . . . . . . . . . . . 90 4.5.1 Feature Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.6 Model adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.1 Feature-based pose estimation . . . . . . . . . . . . . . . . . . . . . . . 95 4.6.2 Shape and expression inference . . . . . . . . . . . . . . . . . . . . . . 98 4.6.3 3D Tracking-based Model refinement . . . . . . . . . . . . . . . . . 101 4.6.4 Initial refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6.5 Deep refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Contents
ix
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Part II Multimedia Knowledge-Based Approaches and Applications 5
Input Devices and Interaction Techniques for VR-Enhanced Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Luigi Gallo and Giuseppe De Pietro 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Interaction Metaphors and Techniques . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.1 Realistic Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.1.1 A Realistic Metaphor: Virtual Hand. . . . . . . . . . 122 5.4.2 Magic Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.2.1 A Magic Metaphor: Virtual Pointer. . . . . . . . . . 123 5.4.3 Pros and Cons of Realistic vs. Magic Interaction Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.5 The Proposed Input Device: the Wiimote . . . . . . . . . . . . . . . . . . . . . 124 5.5.1 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5.2 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6 The Proposed Interaction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6.1 The Manipulation State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6.1.1 Pointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.6.1.2 Translation and Zooming. . . . . . . . . . . . . . . . . . 127 5.6.1.3 Rotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6.2 The Cropping State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6
Bridging Sensing and Decision Making in Ambient Intelligence Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Elie Raad, Bechara Al Bouna and Richard Chbeir 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5 Uncertainty Resolver via Aggregation Functions . . . . . . . . . . . . . . . 142 6.5.1 Average-based Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.5.2 Bayesian Network-Based Function . . . . . . . . . . . . . . . . . . . 143 6.5.3 ”Dempster and Shafer”-Based Function . . . . . . . . . . . . . . . 146 6.5.4 Decision Tree-Based Function . . . . . . . . . . . . . . . . . . . . . . . 148 6.6 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.6.1 Aggregation Function Accuracy and Time Processing . . . 151 6.6.2 Value Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
x
Contents
6.6.2.1 6.6.2.2 6.6.2.3 6.6.2.4 6.6.2.5 6.6.2.6 6.6.2.7
Test 1: Values higher than 0.5 . . . . . . . . . . . . . . 153 Test 2: Values less than 0.5 . . . . . . . . . . . . . . . . 154 Test 3: Random Values . . . . . . . . . . . . . . . . . . . . 154 Test 5: 75% of the values are less than 0.5 . . . . 156 Test 6: Equally distributed values . . . . . . . . . . . 157 Test 7: Distribution change . . . . . . . . . . . . . . . . 158 Test 8: Influence of the number of returned values 0 and 1 on the aggregated result . . . . . . 158 6.6.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.6.3 Template Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.6.3.1 Case 1: using the multimedia function f1 . . . . . 161 6.6.3.2 Case 2: using the multimedia function f2 . . . . . 162 6.6.3.3 Uncertainty threshold tuning . . . . . . . . . . . . . . . 162 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7
Ambient Intelligence in Multimedia and Virtual Reality Environments for the rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Attila Benko and Sik Lanyi Cecilia 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.2 Using AI by special needs users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.2.1 Visual Impairment and Partially Sighted People . . . . . . . . 167 7.2.2 Deaf and Hard-of-Hearing People . . . . . . . . . . . . . . . . . . . . 168 7.2.3 Physically Disabled Persons . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.2.4 Mentally Disabled People . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.2.5 Smart Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.3 A detailed example of using AI in virtual reality for rehabilitation . 170 7.4 Future vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8
Artificial Neural Networks for Processing Graphs with Application to Image Understanding: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Monica Bianchini and Franco Scarselli 8.1 From flat to structural Pattern Recognition . . . . . . . . . . . . . . . . . . . . 179 8.2 Graph processing by neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.2 A general framework for graph processing . . . . . . . . . . . . . 184 8.2.3 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 186 8.2.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.2.5 Other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.3 Graph–based representation of images . . . . . . . . . . . . . . . . . . . . . . . . 189 8.3.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.3.2 Region Adjacency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.3.3 Multi–resolution trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Contents
xi
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
List of Contributors
Bechara Al Bouna LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon, Cedex, France, e-mail:
[email protected] Marco Anisetti University of Milan Department of Information Technology, 26013, Crema (CR), Italy, e-mail:
[email protected] Valerio Bellandi University of Milan Department of Information Technology, 26013, Crema (CR), Italy, e-mail:
[email protected] Attila Benko University of Pannonia Egyetem street 10, H-8200 Veszprem, Hungary, e-mail:
[email protected] Monica Bianchini Department of Information Engineering, University of Siena Via Roma, 56 ,53100 Siena, Italy, e-mail:
[email protected] Sik Lanyi Cecilia University of Pannonia Egyetem street 10, H-8200 Veszprem, Hungary, e-mail:
[email protected] Richard Chbeir LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon, Cedex, France, e-mail:
[email protected] Giuseppe De Pietro ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, Italy, e-mail:
[email protected]
xiv
List of Contributors
Paolo Falcoz University of Milan Department of Information Technology, 26013, Crema (CR), Italy, e-mail:
[email protected] Luigi Gallo University of Naples “Parthenope”, Via A. F. Acton 38, 80133 Naples, Italy, e-mail:
[email protected] ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, Italy, e-mail: gallo.l@ na.icar.cnr.it Gwanggil Jeon Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail: windcap315@ece. hanyang.ac.kr Jechang Jeong Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail: jjeong@ece. hanyang.ac.kr Kangjun Lee Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail: ee9627@ece. hanyang.ac.kr Elie Raad LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon, Cedex, France, e-mail:
[email protected] Franco Scarselli Department of Information Engineering, University of Siena Via Roma, 56 ,53100 Siena, Italy, e-mail:
[email protected]
Chapter 1
GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC Kangjun Lee, Gwanggil Jeon and Jechang Jeong
Summary. Currently, H.264/AVC is adopted in many applications. In the existing coding standard, the MPEG-2 main profile which supports B pictures for bidirectional motion prediction is widely used in applications such as HDTV and DVD’s. Therefore, transcoding the MPEG-2 main profile to the H.264/ AVC baseline profile is necessary for universal multimedia access. In this transcoding architecture including the group of pictures structure conversion, for fast re-estimation, the proposed algorithms adopt the adaptive search range selection through the linearity test of a predictive motion vector. And for fast mode decision, the reference region information is used. The proposed algorithms reduce the computational complexity while maintaining the video quality. Key words: Transcoding, GOP structure, H.264, MPEG-2
1.1 Introduction Since H.264/AVC video coding standard [1] supports high coding efficiency, H.264/ AVC is widely used in many applications. In particular, the H.264/ AVC baseline profile, which can be implemented easily, is widely used in many applications such as DMB, IPTV and storage devices. In the existing coding standard, the MPEGKangjun Lee Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail:
[email protected] Gwanggil Jeon Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail:
[email protected] Jechang Jeong Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea, e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_1, © Springer Science + Business Media, LLC 2009
4
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
2 main profile [2] which supports B pictures for bi-directional motion prediction is widely used in HDTV and DVD’s. Therefore, transcoding the MPEG-2 main profile to the H.264/AVC baseline profile is required for universal multimedia access [3]. The H.264/AVC supports motion estimation with various block size for high compression performance which causes high computational complexity. Therefore, the key factor to improve transcoding performance is reducing the motion estimation (ME) and mode decision complexity while maintaining quality. In this transcoding architecture, the group of pictures (GOP) structure should be changed for the H.264/AVC baseline profile which does not support bi-directional prediction. There is some published literature for fast re-estimation on transcoding including the GOP structure conversion in a cascaded pixel domain transcoder architecture [4], which enables more flexible content conversion. The linearly changing characteristic of the motion vector (MV) according to the time axis is used in [5]. However, high computational complexity calculations are burdened by many predictors. The simple re-estimation algorithm to transcode from the MPEG-2 main profile to the H.264/AVC baseline profile is presented by Xin et al. [6]. By adopting the simple prediction architecture, the computational complexity is reduced. On the other hand, the rate distortion (RD) performance is decreased. For fast mode decision when transcoding MPEG-2 to H.264/AVC, the top-down splitting procedure is used in [7], and each 88 block energy variance of the motion compensated MPEG-2 residual macro block (MB) is used for a fast mode decision [8]. However, in the previously mentioned method, an early skip mode decision is not considered. Thus, the advantage of reducing the computational complexity by early skip mode decision is not acquired. Also, the bit saving by skip mode selection in the homogeneous region is not acquired. The neighboring MB mode information and the threshold are used for fast mode decision including skip mode [9], [10] but the ambiguous mode prediction process incurs additional computational complexity. Also, considering the MB encoding order, the MB mode information in the left, downward position of the current MB could not be used. Thus, when the modes are mixed in some region, the incorrect mode prediction occurs. In this paper, fast mode decision and motion re-estimation in GOP structure conversion focused on transcoding the MPEG-2 main profile to an H.264/AVC baseline profile are presented. In the re-estimation process, adaptive search range selection through the MV linearity test in the temporal direction is adjusted. For the fast mode decision, an adaptive mode selection method using the reference region information is used. Therefore, the proposed algorithms greatly reduce the computational complexity while maintaining the video quality. This paper is organized as follows. Section 1.2 explains the basic re-estimation process and the usefulness of reference region information in transcoding with the GOP structure conversion. The proposed algorithms are explained in Section 1.3. In Section 1.4, a simulation result is presented. Conclusions are presented in Section 1.5.
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
5
1.2 GOP Structure Conversion 1.2.1 MV Scaling in the Temporal Direction In transcoding the MPEG-2 main profile to the H.264/AVC baseline, the GOP structure in the output bitstream is different from that of the input bitstream. In the MPEG-2 main profile, the three types of prediction modes are used in B pictures. The MB coded with forward prediction should be predicted from the past reference picture. The backward prediction uses a future reference picture. With interpolative prediction, the prediction MB is an average of the reference pictures in the forward and backward directions. In P pictures, only forward prediction is used. In three prediction modes, the reference picture for the prediction must be I picture or P picture. On the other hand, only forward prediction is used in an H.264/AVC baseline profile. One aspect of reusing the motion information to reduce complexity is that the MV of MPEG-2 should be scaled to use the predictive motion vector (PMV). For simple implementation, when one previously coded frame in an H.264/AVC encoder part is used only for the reference frame, if a MV changes linearly in the temporal direction, we can use the following equation to calculate the scaled PMV.
MVpmv = MVpmv =
MVoriginal , in f orward prediction Nf −MVoriginal , in backward prediction Nb
(1.1)
In equation 1.1, MVoriginal is the MV in the MPEG-2 decoder. N f and Nb are the number of B pictures between the current frame and the reference frame in forward or backward directions. The MVpmv is used for the predictive motion vector. In the H.264/AVC baseline profile, the B slice for the bi-directional prediction is not supported. Thus, we adjust the inverse position of the backward predicted motion vector for the PMV. The PMV calculated by equation (1) is presented in Fig. 1.1. In Fig. 1.1, the PMVForward is the PMV inferred from the forward direction ME. The PMVBackward is the PMV inferred from the backward direction ME. The PMVinterF is the PMV inferred from the forward direction in interpolative ME prediction. The PMVinterB is the PMV inferred from the backward direction in interpolative ME prediction. The PMV acquired by equation 1.1 can be used as the initial search point of the MB including the linearly moving object, but it is not as precise as PMV in MB including the non-linearly moving object [5].
6
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
Fig. 1.1 The PMV scaled in temporal direction.
1.2.2 Correlation Between the Current MB Mode and the Reference Region The various MB sizes with H.264/AVC are used for increasing the coding efficiency. H.264/AVC supports the following set in P slices. In the following mode set, the prefix MODE representing the inter prediction is used, and the prediction block size is represented by the following number. The SKIP represents the skip mode. {SKIP, MODE 16 × 16, MODE 16 × 8, MODE 8 × 16, MODE 8 × 8, MODE 8 × 4, MODE 4 × 8, MODE 4 × 4, INT RA 4, INT RA 16} In the reference software [12], encoder complexity is limited by the best mode decision process with RD optimization (RDO). Therefore, to reduce the total encoding complexity, fast mode decision is the key factor. Among the various ME modes, SKIP and MODE 16 × 16 are in a homogeneous region [10]. In the complex region, a substantial proportion of MBs is coded with the small block size. Therefore, if we can determine the complexity of some region in the picture, we can reduce the computational complexity by eliminating some modes in the ME process. In MPEG-2 main profile, the I picture or the P picture is used for ME reference of the following B or P picture. As we can see in Fig. 2, if we can have reference region complexity
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
7
information in the I picture or the P picture, and if we can find out the correlation between the complexity of the current MB and the reference region complexity information. We can easily predict the current MB mode.
Fig. 1.2 The reference region indicated by MPEG -2 MV.
sequence skip 16x16 16x8 8x16 8x8 Intra Akiyo 89 6 2 2 1 0 Bus 36 32 11 11 9 1 Coastguard 43 30 9 10 8 0 Football 42 29 8 10 10 1 Foreman 55 23 7 8 6 1 T. Tennis 67 16 6 5 6 0 Table 1.1 The coding mode ration (%) in a homogeneous region (simulated in N=12, M=4, and QP=28).
We discover that the mode coded by H.264/AVC in the reference region is correlated with the current MB mode. When the related MB mode in the reference region is the SKIP or MODE 16 × 16, the ratio of the current MB mode determined by H.264/AVC with RDO is represented in Table 1.1. As seen in Table 1.1, the ratio of SKIP and MODE 16 × 16 is very high. This proves that the MB mode in the reference region is highly correlated with the current MB mode. Since the reference frame is continuously used in the following B and P pictures as the prediction reference, this characteristic is very useful in transcoding, including the GOP structure conversion.
8
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
1.3 Proposed Algorithms 1.3.1 Adaptive Search Range Selection through the MV Linearity Test in the Temporal Direction Adjusting the PMV by using equation 1.1, the PMV causes a prediction error in the MB including the non-linearly moving object. Solving this problem, we determine the linearity of the MB in the temporal direction by comparing the PMVs in the interpolative prediction. As shown in Fig. 1.3, if each PMV is similar in its interpolative prediction, it represents that object in the MB is moving linearly in the temporal direction. If each MV is not similar, the object in the MB is not moving linearly in the temporal direction.
Fig. 1.3 The linearity comparison with interpolative prediction. (a) shows that PMVs in the interpolative prediction is similar. (b) represents that each PMV in the interpolation is quite different.
Fig. 1.4 represents the prediction error according to the difference between PMVinterF and PMVinterB . In Fig. 1.4, the prediction error is improved by increasing the difference between PMVinterF and PMVinterB . When the difference is smaller than 2 pixels, the prediction error is quite small. Therefore, measuring the difference between PMVinterF and PMVinterB , we can decide the adaptive search range for the interpolative prediction. In the forward prediction and the backward prediction where the bi-directional prediction is not used, the average prediction error in the forward prediction is smaller than 3 pixels. And the average prediction error in the backward prediction is smaller than 6 pixels. Therefore, by deciding the search range of the forward prediction and knowing that the backward prediction can not be used to determine linearity, we decide to use the constant search range. Therefore, the adaptive search range is calculated in the following equation 1.2.
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
9
Fig. 1.4 The PMV error in the linearity test.
Forward Prediction search range = 3 Backward Prediction search range = 6 Interpolative Prediction di f f erence = abs(PMVInterF − PMVInterB ) search range = 2 di f f erence = 0, 1 search range = 3 di f f erence = 2, 3 search range = 5 di f f erence = 4, 5 search range = 7 di f f erence = otherwise
(1.2)
1.3.2 Adaptive Mode Decision Method Based on Region Information As previously mentioned, the current MB mode is highly correlated with the MB mode in the reference region indicated by the MPEG-2 MV. For using this characteristic, we classify the reference region in five types. Figure 1.5 represents the various reference regions composed of the different reference modes. The all skip represents all MBs in the reference region composed from the SKIP mode. The skip 16 represents the reference region composed of the combination of SKIP and MODE 16 × 16. In the all 16, the reference region is only composed of MODE 16 × 16. The above mode 8 × 8 means that the block size in the mode de-
10
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
cision of all MBs in the reference region is bigger than the MODE 8 × 8, and in the complex region, the MB with a smaller block size than the MODE 8 × 8 exists in the reference region.
Fig. 1.5 The reference regions in five types.
sequence
all skip skip 16 all 16 skip 16 skip 16 skip 16 Akiyo 97 2 75 13 29 33 Bus 60 25 35 37 17 44 Coastguard 56 26 27 42 7 44 Football 47 32 38 37 26 37 Foreman 71 17 52 28 23 36 Table Tennis 72 13 47 29 27 39 Table 1.2 The coding mode ration (%) in the different reference regions.
In the particular, in the homogeneous reference region where MB coding mode is composed of SKIP or MODE 16 × 16, the ratio of the current MB mode is different depending on the composition ratio between SKIP and MODE 16 × 16. Table 1.2 shows the ratio of SKIP and MODE 16 × 16 in the current MB coding mode depending on the different reference regions in the homogeneous region. When the reference region is all skip, the ratio of the current MB to be coded in SKIP is high. In all 16, the current MB coded in SKIP or MODE 16 × 16 is a higher proportion, but the proportion of MODE 16 × 16 is higher than the pro-
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
11
portion of SKIP. In the skip 16, although the proportion of SKIP is less than the proportion of SKIP in the all skip region, most proportions are coded in SKIP or MODE 16 × 16. Therefore, we can adjust the adaptive mode decision method in different reference regions. Additionally, for the more precise mode prediction, three types of thresholds are used. By using different threshold values in the different reference regions, a more precise prediction mode is executed. The sum of absolute difference (SAD) is used for calculating the threshold. AverSKIP =the average o f the SAD in the predicted skip position o f the SKIP mode MBs in the re f erence region. Aver16 =the average o f the minimum SAD o f MODE 16 × 16 MBs in the re f erence region. AverSKIP FRAME =the average o f the SAD o f the SKIP mode MBs in the previous f rame. The three thresholds are compared with the following two SAD values. It is defined as SADSKIP =the SAD in the predicted SKIP position. SAD16 =the minimum SAD in the search range. When the reference region is above mode 8 × 8, the prediction mode is determined by each 8 × 8 block variance with the sum of the absolute value of the dequantized DCT coefficients of the motion compensated prediction MPEG-2 residual MB [8]. The monotonous region to be predicted as MODE 16 × 16, MODE 16 × 8, MODE 8 × 16, MODE 8 × 8, is the mode prediction determined by comparing the 8 × 8 block variance and is quite precise. For the mode decision in the B picture of MPEG-2, in the all skip region, the ratio of SKIP and MODE 16×16 is very high compared to other reference region. We consider the SKIP and the MODE 16 × 16 in determining the mode by the AverSKIP threshold. In addition, the SKIP ratio is higher than the ratio of MODE 16 × 16. Therefore, calculating the threshold value, the AverSKIP is multiplied by the constant 1.2. In the skip 16 region as seen in Table 1.2, the ratio between the SKIP and MODE 16 × 16 is different depending on the sequence. This means that the mode decision between SKIP and MODE 16 ×16 in the skip 16 region is difficult. Therefore, we use two thresholds. If each SADSKIP and SAD16 is smaller than AverSKIP and Aver16 , the SKIP is selected. Else if only SAD16 is smaller than Aver16 , the RD cost is compared for the mode decision between SKIP and MODE 16 × 16. Else SAD16 is bigger than Aver16 and the energy variance of each 8 × 8 block is compared for each mode decision. When the reference region is the all 16 region, the ratio coded in the MODE 16× 16 as seen in Table 1.2 is higher than the ratio of the SKIP. Thus, if SADSKIP is smaller than AverSKIP FRAME and SAD16 is smaller than Aver16 , the SKIP is selected
12
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
as the prediction mode. Else if SAD16 is smaller than Aver16 , MODE 16 × 16 is selected. Otherwise, the energy variance of each 8 × 8 block is compared. When using the mode prediction reference in the following B and P pictures, a more precise mode decision in the P picture is required. Therefore, when the reference region in the P picture is all skip or skip 16, the mode decision is executed in RDO by enabling SKIP, MODE 16 × 16, MODE 16 × 8, MODE 8 × 16 or MODE 8 × 8. In other regions, the mode decision is executed by the RDO by enabling all modes. In the complex region, the mode decision is executed by the RDO by enabling all modes. Although much computational complexity is burdened by comparing the rate distortion cost, the mode prediction error in the complex region is decreased. The adaptive mode decision process is executed in the following sequence.
#The mode decision in P picture of the MPEG-2# The PMV for the ME is adjusted with equation 1.1. The search range is determined by equation 1.2. If (reference region=all skip or skip 16) Coded by RDO by enabling SKIP, MODE 16 × 16, MODE 16 × 8, MODE 8 × 16, MODE 8 × 8. Else The current MB is coded with the RDO concept and enables all prediction mode.
#The mode decision in B picture of the MPEG-2# The PMV for the ME is adjusted with equation 1.1. The search range is determined by equation 1.2. If (reference region=all skip) if((AverSKIP × 1.2)≥ SADSKIP ) then {SKIP is selected.} else {MODE16×16 is selected} Else if (reference region=skip 16) if(AverSKIP ≥ SADSKIP && Aver16 ≥ SAD16 ) SKIP is selected. else if (Aver16 ≥ SAD16 ) Determined among SKIP and MODE 16 × 16 by RDO. else The energy variance is compared. Else if(reference region=all 16)
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
13
if(AverSKIP FRAME ≥ SADSKIP && Aver16 ≥ SAD16 ) SKIP is selected. else if(Aver16 ≥ SAD16 ) MODE 16 × 16 is selected. else The energy variance is compared. Else if (reference region=above mode 8 × 8) The energy variance is compared. Else { The current MB is coded with the RDO concept and enables all prediction mode. }
1.4 Simulation Results The transcoder, TM5 [11] for the MPEG-2 main profile and J.M 8.6 [12] for the H.264/AVC baseline profile are implemented. For input bitstreams, the sequences in the CIF size (352 × 288) are encoded at 4 Mbps with 30 frames/s and the GOP structure (N = 12, M = 4). In the H.264/AVC baseline encoder part, the previously coded frame is used as the reference frame. Therefore, the GOP structure with output bitstreams is N = ∞, M = 1. All tests are executed on the Intel Pentium Core2 1.86 GHz with 1 GB RAM. For comparing the RD performance in Figs. 1.6-1.9, the reference method represents the sequence that is re-encoded by comparing the RD cost with enabling all modes in which the search range is 16. In the adaptive search range selection (ASRS), the search range is adjusted by the proposed MV linearity test. In the adaptive mode decision method (AMDM), the proposed mode decision method using reference region information is exploited. Figures 1.6-1.9 shows the RD performance in Akiyo, Foreman, Bus and Football sequences. In the Akiyo sequence, the RD performance is almost identical for all methods. But when comparing computational cost compared with the reference method, the 19 % is reduced by using the ASRS method and the 82 % is reduced by using the ASRS+AMDM. In the Foreman sequence, the average PSNR drop from the ASRS method to the reference method is less than 0.05dB. The average PSNR drop from the ASRS+AMDM to the reference method is less than 0.1dB. The computational cost savings of 14 % and 71 % over the reference method is observed in the ASRS and ASRS+AMDM methods. In the bus sequence, the RD performance of the ASRS method is decreased about 0.03 dB compared with the reference method. In the ASRS+AMDM, the decrease is about 0.1dB compared with the reference method. When comparing the computational complexity reduction to the reference method, the 12 % is reduced by the ASRS method. The 70 % is reduced by the ASRS+AMDM method. In the Football sequence, the RD performance of the ASRS
14
Kangjun Lee, Gwanggil Jeon and Jechang Jeong
Fig. 1.6 The RD performance in the Akiyo sequence.
Fig. 1.7 The RD performance in the Foreman Sequence.
method compared with the reference method is almost the same. At bit-rates below 2000 kbps, the average PSNR drop of the ASRS+AMDM compared with the reference method is less than 0.1 dB. At bit-rates above 2000 kbps, the PSNR drop is
1 GOP Structure Conversion in Transcoding MPEG-2 to H.264/AVC
15
Fig. 1.8 The RD performance in the Bus Sequence.
Fig. 1.9 The RD performance in the Football Sequence.
less than 0.15 dB. Comparing computational complexity reduction to the reference method, the 11 % is reduced by the ASRS method. The 68 % is reduced by the ASRS+AMDM method. As shown in Table 1.3, the computational savings of the Akiyo sequence with many homogeneous regions is remarkable.
16
Kangjun Lee, Gwanggil Jeon and Jechang Jeong Sequence Reference ASRS ASRS+AMDM Akiyo 592 476(-19%) 101(-82%) Bus 687 606(-12%) 202(-70%) Football 645 568(-11%) 194(-68%) Foreman 611 516(-14%) 172(-71%)
Table 1.3 The encoding times (s) comparison with different transcoding methods.
1.5 Conclusion The proposed algorithms are efficient in transcoding architecture including GOP structure conversions such as transcoding the MPEG-2 main profile to the H.264/AVC baseline profile. By maintaining the video quality, the computational complexity is extremely reduced by the adaptive search range selection in the MV linearity test and the adaptive mode decision using the reference region information.
References 1. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification. ITU-T Rec. H.264 — ISO/IEC 14496-10 AVC (2003) 2. Information TechnologyGeneric Coding of Moving Pictures and Associated Audio Information: Video. ISO/IEC 13818-2 (1995) 3. Mohan, R., Smith, J.R., Li, C.-S.: Adapting multimedia Internet content for universal access. IEEE Trans. Multimedia, 1, (1) 104–114 (1999) 4. Sun, H., Kwok, W., Zdepski, J.W.: Architectures for MPEG compressed bitstream scaling. IEEE Trans. Circuits System Video Technology, 6(2), 191–199 (1996) 5. Shanableh, T., Ghanbari, M.: The importance of the bi-directionally predicted pictures in video streaming. IEEE Trans. Circuits System Video Technology., 11(3), 402–414 (2001) 6. Xin, J., Vetro, A., Sekiguchi, S., Sugimoto, K. : Motion and mode mapping for MPEG-2 to H.264/AVC transcoding. Proc. of IEEE Int. Conf. Multimedia and Expo, pp. 313–316 (2006) 7. Zhou, Z., Sun, S., Lei, S., Sun, M.T.: Motion information and coding mode reuse for MPEG-2 to H.264 transcoding. Proc of IEEE Int. Symp. Circuits Syst., 2, 1230–1233 (2005) 8. Chen, G., Zhang, Y., Lin, S., Dai, F.: Efficient block size selection for MPEG-2 to H.264 transcoding. Proc. Proc. of ACM Int. Conf. Multimedia, 300–303 (2004) 9. Lu, X., Tourapis, A.M., Yin, P. Boyce, J.: Fast mode decision and motion estimation for H.264 with a focus on MPEG-2/H.264 transcoding. Proc of IEEE Int. Symp. Circuits Syst., 2, 1246–1249 (2005) 10. Grecos, C., Yang, M. Y: Fast inter mode prediction for p slices in the H.264 video coding standard. IEEE Trans. Broadcasting, 51(2), 256–263 (2005) 11. Test Model 5, ISO/IEC JTC1/SC29/WG11, N0400 (1993) 12. H.264/AVC Reference Software JM 8.6: http://bs.hhi.de/$\sim$suehring/ tml/
Chapter 2
Simple Low Level Features for Image Analysis Paolo Falcoz
Summary. As human beings, we perceive the world around us mainly through our eyes, and give what we see the status of “reality”; as such we historically tried to create ways of recording this reality so we could augment or extend our memory. From early attempts in photography like the image produced in 1826 by the French inventor Nic´ephore Ni´epce (Figure 2.1) to the latest high definition camcorders, the number of recorded pieces of reality increased exponentially, posing the problem of managing all that information. Most of the raw video material produced today has lost its memory augmentation function, as it will hardly ever be viewed by any human; pervasive CCTVs are an example. They generate an enormous amount of data each day, but there is not enough “human processing power” to view them. Therefore the need for effective automatic image analysis tools is great, and a lot effort has been put in it, both from the academia and the industry. In this chapter, a review of some of the most important image analysis tools are presented. Key words: Color Space, Blob detection, Shape Detection, Procrustes Analysis
2.1 Introduction As human beings, we perceive the world around us mainly through our eyes, and give what we see the status of “reality”; as such we historically tried to create ways of recording this reality so we could augment or extend our memory. From early attempts in photography like the image produced in 1826 by the French inventor Nic´ephore Ni´epce (Figure 2.1) to the latest high definition camcorders, the number of recorded pieces of reality increased exponentially, posing the problem of managing all that information. Most of the raw video material produced today has lost Paolo Falcoz University of Milan, Department of Information Technology, Via bramante 65, 26013 Crema Italy e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_2, © Springer Science + Business Media, LLC 2009
18
Paolo Falcoz
its memory augmentation function, as it will hardly ever be viewed by any human; pervasive CCTVs are an example. They generate an enormous amount of data each day, but there is not enough “human processing power” to view them. Therefore the need for effective automatic image analysis tools is great, and a lot effort has been put in it, both from the academia and the industry. Results from different research groups are impressive, and the DARPA1 Grand and Urban Challenge [9] can be considered the showcase for the state-of-the-art in image processing. It may be useful to recall that the DARPA Grand Challenge is a prize competition for driverless cars, sponsored by the DARPA with the goal of developing technologies needed to create the first fully autonomous ground vehicle. The third event, The DARPA Urban Challenge, which took place on November 3, 2007, further advanced vehicle requirements to include autonomous operation in a mock urban environment. Robotics also has many meaningful examples of deep achievements in image processing, from the well known humanoid robot Asimo[14] to the less friendly machine-gun equipped sentry robot developed in South Korea by Korea University and Samsung [33]. More and more examples of complex image analysis tools embedded in consumer electronic equipments are available today, and the face detection feature built in some Canon cameras2 , and Sony camcorders3 are an example. Image processing algorithms can be extremely complex, but there are some basic operations and features that – whatever the complexity – are almost always considered. Among them we can mention: • • • •
color analysis; edge extraction; shape matching; texture analysis.
For each of those features there exist many different algorithms with different goals and complexity, but taken alone most of them perform well only under specific conditions, and are lacking in the general case. Note that by “specific conditions” we include the need of having a dedicated training database, so that after training the algorithm works well only for the class of object for which it has been trained. A simple solution is to combine two or more different features together, so that the strengths of a feature can overcome the weaknesses of another, and vice versa. A similar problem has been faced by the MPEG-7 standard [21], which decided to make use of shape, region, and color descriptors altogether [2]. In the following sections we will discuss the meaning of “color” (section 2.2) and “color space” (sections 2.2.2 and 2.2.3); then we will use color to extract blobs (section 2.3). Different edge detectors will be presented in section 2.4, while in 1
The Defense Advanced Research Projects Agency (DARPA) is an agency of the United States Department of Defense responsible for the development of new technology for use by the military. DARPA has been responsible for funding the development of many technologies which have had a major impact on the world, including ARPANET, the ancestor of the modern Internet. 2 Canon Powershot S5 IS 3 Sony HDR-CX12 HD AVCHD
2 Simple Low Level Features for Image Analysis
19
Fig. 2.1 Nic´ephore Ni´epce’s earliest surviving photograph, c. 1826 (View from the window of Le Gras). This image required an eight-hour exposure, which resulted in sunlight being visible on both sides of the buildings.
section 2.5.1 and 2.5.2 we will introduce Procrustes Analysis and Iterative Closest Point algorithm for shape registration (alignment). Section 2.5.3 will deal with Curvature Scale Space Descriptors (CSSD), an effective way of describing shapes using scale, position, and rotation invariants; CSSD can be encoded for fast shape matching [28]. Section 2.6 presents some simple ideas for combining different features so that valuable knowledge can be extracted.
2.2 The Role of Color From an anatomical point of view, all human interaction with “color” is mediated by the retina, the light-sensitive layer at the back of the eye that covers about 65 percent of its interior surface. Photosensitive cells called rods and cones in the retina convert incident light energy into signals that are carried to the brain by the optic nerve. Rods are attributed night vision, motion detection, and peripheral vision, while cones are attributed both color vision and the highest visual acuity [12]. In the middle of the retina is a small dimple called the fovea or fovea centralis. It is the center of the eye’s sharpest vision and the location of most color perception. In fact, while cones are concentrated in the fovea, rods are absent there but dense elsewhere. Measured density curves for the rods and cones on the retina show an enormous density of cones in the fovea (Figure 2.2 (a)).
20
Paolo Falcoz
Considering for humans a global field of view of about 180◦ and a color field of view of about 15◦ , and translating the ratio between them into a 640 × 480 image, we find that actual color vision happens only within a 53 × 40 region. In Figure 2.2 (b) the inner rectangle is where color vision happens. If fact the rectangle should be blurred, because a small amount of cones is present also at bigger separation angles. Note that we used a rectangle only for simplicity, but a circle or an ellipse can be used as well. From a perceptual point of view, “color” is the visual perceptual property corresponding in humans to the categories called red, yellow, blue, black, etc. Color categories and physical perception of color are obviously related with objects, materials, light sources, etc., and their physical properties of light absorption, reflection, and emission. Color is therefore a very complex feature whose description depends on light characteristics, environment conditions, and sensor quality; the same “physical” red color with a wavelength of 780nm has different descriptions when perceived by a normal person, by a color-blind person, or by a webcam sensor (Figure 2.3).4 Despite its complexity and drawbacks, color is still a very important feature, used in many image processing tasks (ex. skin detector) with some clear advantages: • it is very easy to compute; • it is independent of image size and orientation. However, in order to formalize the concept of color, we need to introduce the concept of color space.
2.2.1 Color Spaces A color model is an abstract mathematical model describing the way colors can be represented as tuples of numbers, typically as three or four values or color components. When this model is associated with a precise description of how the components are to be interpreted (viewing conditions, etc.), the resulting set of colors is called a color space. Adding a certain mapping function between the color model and a certain reference color space results in a definite “footprint” within the reference color space. This ”footprint” is known as a gamut, and, in combination with the color model, defines a new color space. For example, Adobe RGB and sRGB are two different color spaces, both based on the RGB model. The RGB color model is an additive color model in which red, green, and blue light are added together in various ways to reproduce a broad array of colors. It is additive in the sense that the three light beam are added together, and their light spectra add, wavelength for wavelength, to make the final color’s spectrum. Zero intensity for each component gives the darkest color (no light, considered the black), 4
Image taken from http://en.wikipedia.org/wiki/Color_blindness
2 Simple Low Level Features for Image Analysis
21
a
b Fig. 2.2 Original image (a), and proportion of the image actually seen in full color at any instant (b).
22
Paolo Falcoz
Fig. 2.3 An 1895 illustration of normal vision and various kinds of color blindness.
and full intensity of each gives a white. The name of the model comes from the initials of the three additive primary colors, red, green, and blue. Oddly enough, the first known permanent color photo was taken by James Clerk Maxwell using the RGB color model developed by Thomas Young, Hermann Helmholtz and Maxwell himself. Figure 2.4 shows the photo, taken in 18615 . Those first experiments color photography involved the process of three color-filtered separate takes [13]. To reproduce the color photograph, three matching projections over a screen in a dark room were necessary. Note that subtractive color models exist too; they work by partially or entirely masking certain colors on a typically white background (that is, absorbing particular wavelengths of light). Such models are called subtractive because colors “subtract” 5
Image taken from Wikipedia, http://en.wikipedia.org/wiki/RGB_color_model
2 Simple Low Level Features for Image Analysis
23
Fig. 2.4 The first permanent color photograph, taken by J. C. Maxwell in 1861 using three red, green, and violet-blue filters.
brightness from white. In the case of CMY those colors are cyan, magenta, and yellow. There are many different color spaces, however when dealing wit human perception of color, only a few should be considered: those defined to be perceptuallyuniform. A perceptually-uniform color space is a color space in which any two colors that are perceived as “close” are “close” also in their numerical representation, and vice versa. For example, CIE-Lab is perceptually uniform, while Adobe RGB and sRGB are not. In the next two subsections we will focus on three different color spaces, the first two – HSL and HSV – represent an attempt to derive a more perceptually uniform color space from RGB, while the third – CIE-Lab – was conceived to be perceptually uniform.
2.2.2 HSL and HSV HSL and HSV are two related representations of points in an RGB color space, which attempt to describe perceptual color relationships more accurately than RGB, while remaining computationally simple. HSL stands for hue, saturation, lightness, while HSV stands for hue, saturation, value.
24
Paolo Falcoz
Both HSL and HSV describe colors as points in a cylinder (Figure 2.5) whose central axis ranges from black at the bottom to white at the top with neutral colors between them, where angle around the axis corresponds to “hue”, distance from the axis corresponds to “saturation”, and distance along the axis corresponds to “lightness”, “value”, or “brightness”.
Fig. 2.5 Graphical representation of HSV cylinder
The two representations are similar in purpose, but differ somewhat in approach. Both are mathematically cylindrical, but while HSV (hue, saturation, value) can be thought of conceptually as an inverted cone of colors (with a black point at the bottom, and fully-saturated colors around a circle at the top), HSL conceptually represents a double-cone or sphere (with white at the top, black at the bottom, and the fully-saturated colors around the edge of a horizontal cross-section with middle gray at its center). Note that while “hue” in HSL and HSV refers to the same attribute, their definitions of “saturation” differ dramatically (Figure 2.6)6 . Because HSL and HSV are simple transformations of RGB, the color defined by a (h, s, l) or (h, s, v) tuple depends on the particular color of red, green, and blue “primaries” used. Note that in practice those primaries are strictly related to the technology used to generate them; the actual “blue” color generated by the blue electron gun used in cathode ray devices is different from the blue generated by the blue LEDs of LED devices, and is different from the blue detectors of a CCD camera. Each unique RGB device therefore has unique HSL and HSV spaces to accompany it. An (h, s, l) or (h, s, v) tuple can however become definite when it is tied to a particular RGB color space, such as sRGB. 6
Image taken from http://en.wikipedia.org/wiki/HSL_and_HSV
2 Simple Low Level Features for Image Analysis
25
Fig. 2.6 Comparison of the HSL and HSV color spaces.
Both models were first formally described in 1978 by Alvy Ray Smith [30], though the concept of describing colors by these three dimensions, or equivalents such as hue, chroma, and tint, was introduced much earlier [27].
2.2.3 CIE-Lab CIELAB is the second of two systems adopted by CIE7 in 1976 as models that better showed uniform color spacing in their values. CIELAB is an opponent color system based on the earlier (1942) system of Richard Hunter [17][18] called L, a, b. Color opposition correlates with discoveries in the mid-1960s that somewhere between the optical nerve and the brain, retinal color stimuli are translated into distinctions between light and dark, red and green, and blue and yellow. CIELAB indicates these values with three axes: L*, a*, and b* (Figure 2.7)8 . The central vertical axis represents lightness (signified as L*) whose values run from 0 (black) to 100 (white). This scale is closely related to Munsell’s [25][26] value axis except that the value of each step is much greater. This is the same lightness valuation used in CIELUV. The color axes are based on the fact that a color can’t be both red and green, or both blue and yellow, because these colors oppose each other. On each axis the values run from positive to negative. On the a-a’ axis, positive values indicate amounts of red while negative values indicate amounts of green. On the b-b’ axis, yellow is positive and blue is negative. For both axes, zero is neutral gray. Therefore, values are only needed for two color axes and for the lightness or grayscale axis (L*), which is separate (unlike in RGB, CMY or XYZ where lightness depends on relative amounts of the three color channels).
7 8
Comission Internationale de l’Eclairage The full nomenclature is 1976 CIE L*a*b* Space.
26
Paolo Falcoz
Fig. 2.7 Graphical representation of CIE-Lab color space
2.2.4 Color Flattening Now that we know what color is from an abstract point of view, we are ready to work with actual colors from digital images and videos. The problem is that they are mostly shot with low cost, low quality equipment, meaning non uniform colors and evident noise. One method to cope with this is to flatten colors: a very common filter used for this purpose is the blur filter. The problem with this approach is that not only noise but also edges are flattened, causing lost of potentially important information. A better solution is to perform several steps of bilateral filtering [34] (Figure 2.8). The effectiveness of this approach is to combine a low-pass filter with a range filter h(x) = k
−1
(x)
Z inf Z inf
− inf − inf
f (ξ )c(ξ , x)s( f (ξ ), f (x))dξ
where k(x) =
Z inf Z inf
− inf − inf
c(ξ , x)s( f (ξ ), f (x))dξ
c(ξ , x) measures the geometric closeness between the neighborhood center x and a nearby point ξ , s( f (ξ ), f (x)) measures the photometric similarity between the pixel at the neighborhood center x and that of a nearby point ξ , and f () represents the image function. The low-pass filter is defined by c(ξ , x), while the range filter
2 Simple Low Level Features for Image Analysis
27
is defined by s( f (ξ ), f (x)). Since this technique combines two different filters, it is called bilateral filtering. The actual implementation of low-pass and range filters can be based on simple Gaussian filtering, where both the closeness function c(ξ , x) and the similarity function s( f (ξ ), f (x)) are Gaussian functions of the Euclidean distance between their arguments. Closeness then becomes − 12
c(ξ , x) = e
d(ξ ,x) 2 σd
where d(ξ , x) = d(ξ − x) = |ξ − x| while similarity becomes s(ξ , x) = e
− 12
δ ( f (ξ ), f (x)) 2 σr
where δ (φ , f ) = δ (φ − f ) = |φ − f | The meaning of bilateral filtering is to replace the pixel value at x with an average of similar (photometric similarity) and nearby (geometric closeness) pixel values. In smooth regions, pixel values in a small neighborhood are similar to each other, and the normalized similarity function is close to one. As a consequence, the bilateral filter acts essentially as a standard domain filter, and averages away the small, weakly correlated differences between pixel values caused by noise. Consider now a sharp boundary between a dark and a bright region. Suppose on the other hand that the bilateral filter is centered, on a pixel on the bright side of the boundary, then the similarity function assumes values close to one for pixels on the same side, and close to zero for pixels on the dark side. The normalization term x ensures that the weights for all the pixels add up to one. As a result, the filter replaces the bright pixel at the center by an average of the bright pixels in its vicinity, and essentially ignores the dark pixels. Conversely, when the filter is centered on a dark pixel, the bright pixels are ignored instead. Thus, good filtering behavior is achieved at the boundaries, thanks to the domain component of the filter, and crisp edges are preserved at the same time, thanks to the range component.
2.3 Blob Detection Blob detection and extraction proves to be a useful tool in many areas; one main application is to provide complementary information about regions, which is not obtained from edge detectors or corner detectors. In early work in the area, blob detection was used to obtain regions of interest for further processing. These re-
28
Paolo Falcoz
(a)
(b) Fig. 2.8 Original (a) and corresponding flattened (b) image (4 steps)
gions could signal the presence of objects or parts of objects in the image domain with application to object recognition and/or object tracking. In other domains, such as histogram analysis, blob descriptors can also be used for peak detection with application to segmentation. Another common use of blob descriptors is as main primitives for texture analysis and texture recognition. In more recent work, blob descriptors have found increasingly popular use as interest points for wide baseline
2 Simple Low Level Features for Image Analysis
29
(a)
(b)
(c)
(d)
Fig. 2.9 Original image (a), mask (white) of lilac blob (b), mask after morphological closing (c), mask after noise removal (d).
stereo matching [22] and to signal the presence of informative image features for appearance-based object recognition based on local image statistics. Simple blob extraction and refinement based on color ranges can be achieved using the following idea: 1. given the color input image I, create a binary matrix M with the same width and height of I, and set all the elements to 0; 2. scan the input image element by element and check if the value of each color plane falls within the specified range. If yes, then put the corresponding mask’s element to 1 (Figure 2.9 (b)); 3. apply a morphological “closing” (dilation followed by erosion) to M in order to fill holes and to smooth blobs (Figure 2.9 (c)); 4. remove isolated group of pixels smaller than a given threshold (Figure 2.9 (d)); 5. scan M and label all 8-connected blobs. Labeling can be done using the techinique outlined in [10]. Each blob’s mass is then calculated, along with color statistics. Note that the previous procedure can be applied to blob detection based on characteristics other than color, the only changing part is the one that assigns an element to the blob or to the background (the non-blob area)[15]. Since there can be many blobs of the same color (with the same characteristic), a selection criterion can be
30
Paolo Falcoz
used in order to take only the n best ones; foe example, if “best” means “biggest”, then only the biggest n blobs are considered.
2.4 Edge Detection From an algorithmic point of view, edge detection translates into detecting sharp changes in image brightness; the underlying assumption is that such brightness changes are strongly correlated with important events and properties changes of the world (represented by the image). In general, discontinuities in image brightness are likely to correspond to: • • • •
discontinuities in depth; discontinuities in surface orientation; changes in material properties; variations in scene illumination.
In the ideal case, the result of applying an edge detector to an image may lead to a set of connected curves that indicate the boundaries of objects, the boundaries of surface markings as well curves that correspond to discontinuities in surface orientation. Thus, applying an edge detector to an image may significantly reduce the amount of data to be processed and may therefore filter out information that may be regarded as less relevant, while preserving the important structural properties of an image. If the edge detection step is successful, the subsequent task of interpreting the information contents in the original image may therefore be substantially simplified. Unfortunately, however, it is not always possible to obtain such ideal edges from real life images of moderate complexity. Edges extracted from nontrivial images are often hampered by fragmentation, meaning that the edge curves are not connected, missing edge segments as well as false edges not corresponding to interesting phenomena in the image – thus complicating the subsequent task of interpreting the image data. There are many different algorithms for computing edges [32][29][11][35], but three must be cited: • Prewitt operator [31]; • Sobel operator; • Canny filter [3]. The best of the three is the Canny filter, the other two are interesting because they are simple and fast (Figure 2.10). Mathematically, both the Sobel and Prewitt operators use two 3x3 kernels which are convolved with the original image to calculate approximations of the derivatives – one for horizontal changes, and one for vertical. Given the input image I, the output images Gx and Gy which at each point contain the horizontal and vertical derivative approximations, are calculated as follows
2 Simple Low Level Features for Image Analysis
31
GSx = Sv ⊗ I
GSy = Sh ⊗ I
GPx = Pv ⊗ I
GPy = Ph ⊗ I
where ⊗ denotes the bidimensional convolution operator, and 1 0 −1 1 2 1 Sv = 2 0 −2 Sh = 0 0 0 1 0 −1 −1 −2 −1 −1 −1 −1 −1 0 1 Ph = 0 0 0 Pv = −1 0 1 1 1 1 −1 0 1 denotes the kernels for vertical and horizontal changes of the Sobel and Prewitt operators.
(a)
(b)
(c)
(d)
Fig. 2.10 Original image (a), Sobel (b), Prewitt (c), and Canny (d) edge detectors.
Canny builds on top of Sobel/Prewitt operators, considering the mathematical problem of deriving an optimal smoothing filter given the criteria of detection, localization and minimizing multiple responses to a single edge. He showed that the optimal filter given these assumptions is a sum of four exponential terms. He also showed that this filter can be well approximated by first-order derivatives of Gaus-
32
Paolo Falcoz
sians. Canny also introduced the notion of non-maximum suppression, which means that the image is scanned along the image gradient direction, and if pixels are not part of the local maxima they are set to zero. This has the effect of supressing all image information that is not part of local maxima. Because the Canny edge detector uses a filter based on the first derivative of a Gaussian, it is susceptible to noise present on raw unprocessed image data, so the first step is to convolve the raw image with a Gaussian filter. The result is as a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. An edge in an image may point in a variety of directions, so the Canny algorithm uses four filters to detect horizontal, vertical and diagonal edges in the blurred image. The edge detection operator (Prewitt, Sobel for example) returns a value for the first derivative in the horizontal direction (Gy ) and the vertical direction (Gx ). From this the edge gradient and direction can be determined: q G = Gx 2 + Gy 2 Θ = arctan
Gy Gx
The edge direction angle is rounded to one of four angles representing vertical, horizontal and the two diagonals (0, 45, 90 and 135 degrees for example). Given estimates of the image gradients, a search is then carried out to determine if the gradient magnitude assumes a local maximum in the gradient direction (non maximum suppression). So, for example, if the rounded angle is zero degrees the point will be considered to be on the edge if its intensity is greater than the intensities in the north and south directions, if the rounded angle is 90 degrees the point will be considered to be on the edge if its intensity is greater than the intensities in the east and west directions, if the rounded angle is 135 degrees the point will be considered to be on the edge if its intensity is greater than the intensities in the north east and south west directions, if the rounded angle is 45 degrees the point will be considered to be on the edge if its intensity is greater than the intensities in the south east and north west directions. This is worked out by passing a 3x3 grid over the intensity map. From this stage a set of edge points, in the form of a binary image, is obtained. Intensity gradients which are large are more likely to correspond to edges than if they are small. It is in most cases impossible to specify a threshold at which a given intensity gradient switches from corresponding to an edge into not doing so. Therefore Canny uses thresholding with hysteresis. Thresholding with hysteresis requires two thresholds – high and low. Making the assumption that important edges should be along continuous curves in the image allows us to follow a faint section of a given line and to discard a few noisy pixels that do not constitute a line but have produced large gradients. Therefore we begin by applying a high threshold. This marks out the edges we can be fairly sure are genuine. Starting from these, using the directional information derived earlier,
2 Simple Low Level Features for Image Analysis
33
edges can be traced through the image. While tracing an edge, we apply the lower threshold, allowing us to trace faint sections of edges as long as we find a starting point. Once this process is complete we have a binary image where each pixel is marked as either an edge pixel or a non-edge pixel.
2.5 Simple Shapes Edge detectors are a fundamental step in shape analysis, as well as algorithms for shape comparison and matching. Shape matching usuallly requires the evaluation of a distance between the shape themselves or their projection in some other feature space; n-dimensional (or vector) euclidean distance is a good candidate.
2.5.1 Scale and Position Invariants: Procrustes Analysis In order to compare two shapes we need to make them independent of position and size. Procrustes analysis is a form of statistical shape analysis used to analyse the distribution of a set of shapes. The name Procrustes refers to a bandit from Greek mythology who made his victims fit his bed either by stretching their limbs or cutting them off. Here we just consider objects made up from a finite number k of points in n dimensions; these points are called landmark points. The shape of object can be considered as a member of an equivalence class formed by removing the translational, rotational and scaling components. For example, translational components can be removed from an object by translating the object so that the mean of all the points lies at the origin. Likewise the scale component can be removed by scaling the object so that the sum of the squared distances from the points to the origin is 1. Mathematically: take k points in two dimensions, ((x1 , y1 ), (x2 , y2 ), . . . , (xk , yk )) The mean of these points is (x, ¯ y), ¯ where x¯ =
1 k ∑ xi k i=1
y¯ =
1 k ∑ yi k i=1
Now translate these points so that the mean is translated to the origin (x, y) → ¯ y1 − y), ¯ . . . . Likewise scale can be removed by (x − x, ¯ y − y), ¯ giving the point (x1 − x, finding the size of the object
34
Paolo Falcoz
s=
q
(x1 − x) ¯ 2 + (y1 − y) ¯ 2 +···
and dividing the points by the scale giving points ((x1 − x)/s, ¯ (y1 − y)/s). ¯ Other methods for removing the scale can also be used.
2.5.2 Shape Alignment: Iterative Closest Point Iterative Closest Point (ICP) was introduced by Besl and McKay in 1992 [1] and solves the general problem of matching two clouds of points. This matching technique can be used from simple 2D shape alignment to complex 3D surfaces reconstruction. The algorithm is very simple and is commonly used in real-time. The goal of ICP is to find the rigid transformation T that best aligns a cloud of scene points S with a geometric model M. The alignment process works to minimize the mean squared distance between scene points and their closest model point. ICP is efficient, with average case complexity of O(n log n) for n point images and it converges monotonically to a local minimum. At each iteration, the algorithm computes correspondences by finding closest points and, then, minimizes the mean square error in position between the correspondences [6] [16]. A good initial estimate of the transformation is required and all scene points are assumed to have correspondences in the model9 . Algorithm 1: Iterative Closest Point Initial situation: Let S be a set of Ns points {s1 , . . . , sNs }, and let M be the model. Let ks − mk be the distance between points s ∈ S and m ∈ M, and let CP(si , M) be the closest point in M to the scene point si ; Phase 1: Let T0 be an initial estimate of the transformation; Phase 2: Repeat for k = 1, . . . , kmax or until ternination criteria is met 1. Build the set of corrispondences C=
Ns [
{(Tk−1 (si ),CP(Tk−1 (si ), M))}
i=1
2. Compute the new transformation Tk that minimizes mean square error between point pairs in C [6] [16] The result will be the refined transformation Tkmax (translation, rotation). For rigid deformation, the distance used in ICP is only the Euclidean distance. Point in rigid deformation is 3D point with components (x, y, z) [1][36].
9 If the model shape can be parameterized, this limitation can be overcome by generating a model with a number of points equal to that of the scene.
2 Simple Low Level Features for Image Analysis
35
For non-rigid deformation, it is not correct anymore to say that corresponding points have the closest Euclidean distance. The distance should be redefined to describe the similarity between corresponding points. Feldmar [7] defines a 3D Euclidean point in 8D with normal (nx , ny , nz ) and principal (k1 , k2 ) curvatures in addition to the (x, y, z) components. Given two surfaces S1 and S2 , his definition is: d(M, N) = (α1 (x − x′ )2 + α2 (y − y′ )2 + α3 (z − z′ )2 + α4 (nx − n′x )2 + α5 (ny − n′y )2 + α6 (nz − n′z )2 + α7 (k1 − k1′ )2 + α8 (k2 − k2′ )2 )1/2 where M is a point on surface S1 , N is a point on surface S2 , (nx , ny , nz ) is the normal on S1 at point M, k1 , k2 are the principal curvatures, and αi is the difference between the maximal and minimal value of the ith coordinate of points in S2 . In his definition both global and local affine are implemented.
2.5.3 Shape Encoding and Matching: Curvature Space Scale The curvature scale space (CSS) was introduced by Mokhtarian and Mackworth [23] [24] as a shape representation for planar curves. The representation is computed by convolving a path-based representation of the curve with a Gaussian function, as the standard deviation of the Gaussian varies from a small to a large value, and extracting the curvature zero-crossing points of the resulting curves. The representation is essentially invariant under rotation, uniform scaling, and translation of the curve. This and a number of other properties makes it suitable for recognizing a noisy curve of arbitrary shape at any scale or orientation. After substantial and comprehensive testing, the CSS technique was selected as a contour shape descriptor for MPEG-7 [21]. To create a CSS description of a contour shape, N equi-distant points are selected on the contour, starting from an arbitrary point on the contour and following the contour clockwise. The x and ycoordinates of the selected N points are grouped together into two series X and Y . The contour is then gradually smoothed by repetitive application of a Gaussian kernel10 to X and Y coordinates of the selected contour points. As a result of the smoothing, the contour evolves and its concave parts gradually flatten-out, until the contour becomes convex. A so-called CSS image can be associated with the contour evolution process11 . The CSS image horizontal coordinates correspond to the indices of the contour points selected to represent the contour (1, . . . , N), and CSS image vertical coordinates correspond to the amount of filtering applied, defined as the number of passes of the filter. Each horizontal line in the CSS image corresponds to the smoothed contour resulting from k passes of the filter (Figure 2.11). For each smoothed contour, 10
The MPEG-7 standard uses a low-pass filter with the kernel (0.25, 0.5, 0.25) the CSS image does not have to be explicitly extracted, but is useful to illustrate the CSS representation.
11
36
Paolo Falcoz
the zero-crossings of its curvature function are computed. Curvature zero-crossing points separate concave and convex parts of the contour. The CSS image has characteristic peaks. The coordinate values of the prominent peaks (xcss , ycss ) in the CSS image are extracted; in addition, the eccentricity and circularity of the contour can also be calculated.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2.11 Original image (a), outer contour (b), after 15 filtering steps (c), 30 filtering steps (d), after 45 filtering steps (e), after 60 filtering steps (f).
Once the css descriptors have been extracted, the shape is ready for matching. The canonical CSS (CCSS) based shape retrieval algorithm [5] is to compare a CSS
2 Simple Low Level Features for Image Analysis
37
descriptor of query image with a set of CSS descriptor of database images, and to return the n best match as its output. In order to find the minimum cost of the match between a query image and a database image, the algorithm must consider all possible ways of aligning the contour maxima from both CSS images, and compute the associated cost by shifting the query CSS image or a database CSS image. Unfortunately, in general the computation of a CSS image can take a long time, making it difficult to apply the method to real-time object recognition. To overcome this limitation, several variations and hybridizations of the original algorithm have been proposed [19] [28] [37].
2.6 Combination of simple features We will now use some of the features presented in the previous sections to build a sky detector; there exist many good sky detectors, but our aim is to show that with the combination of a few simple generic features, interesting results can be obtained. This is just an example and from the point of view of pure performance it cannot be compared to dedicated algorithms [20] [8]. The idea behind our sky detector is the following: 1. extract blue blobs B from input image I; 2. use a simple texture analysis to discriminate between sky blobs and non-sky blobs; 3. extract edges E from I using Canny edge detector; 4. perform binary and between B and E to generate the combination mask C; 5. perform morphological closing over C; 6. run again texture analysis to discriminate between sky blobs and non-sky blobs. The first step is to define what is the meaning of “sky” from the point of view of the color “blue”: after a manual sampling over some twenty images with sky, we define “blue” to be the color in the following HSV range 140 ≤ x ≤ 300 x ∈ H, H = {h ∈ R, 0 ≤ h ≤ 360} 0 ≤ y ≤ 0.45 y ∈ S, S = {s ∈ R, 0 ≤ s ≤ 1} 76 ≤ z ≤ 255 z ∈ V, V = {v ∈ N , 0 ≤ v ≤ 255}
The result of blob extraction using this definition can be seen in Figure 2.13 (b). There are many non-sky blobs, so we use a simple consideration made by Luo [20] to discriminate good blobs from bad blobs: as a result of the physics of light scattering by small particles in the air, clear sky often appears in the shade of deep, saturated blue at the top of the image and gradually desaturates to almost white towards a distant horizon line in the image. What we need is then a gradient detector to measure this desaturation effect. The simplest way we can think of, is to measure the difference in saturation between the top and the bottom pixels of each blobs; if the difference is bigger than a threshold, then we mark the blob as sky, otherwise as non-sky. Many improvements can be
38
Paolo Falcoz
done, but even in its naive simplicity this approach works well enouh for our purpose (Figure 2.13 (c)). The effect of texture analysis is to delete many but not all non-sky blobs, so the next step is to take advantage of edge detection to better partition sky blobs. It is evident from Figure 2.13 (c) that the big sky blob is the sum of a little “true” sky blob at the top plus a portion of a hill at the bottom. From Figure 2.12 (a) we can see that the edge detector detects the border line between the sky and the hill, so we superimpose the edges to the blob and check if there are edges that partition the blob (Figure 2.12 (d)). To make those “fractures” more evident we perform a morphological closing (Figure 2.12 (e)). The new blobs created are visible in Figure 2.13 (d). The last step is to re-run texture analysis and discriminate again good from bad blobs (Figure 2.13 (e), (f)). The result is that only the true sky blob is marked as good. In many cases this simple algorithm works well, however it depends much on the quality of the edges found, and tends to over-fragment the blobs. The same ideas can be used to find vegetation, skin, and so on by changing the color definition and the texture discrimination function; constraints on blobs shape can be added using the algorithms presented in Section 2.5.
2.7 Conclusions The goal of this chapter was to introduce some basic ideas on image features extraction, particularly from the point of view of color, edges, and shapes. The algorithms presented are well-known and widely used, and even if some of them were conceived many years ago, they still can be considered state-of-the-art. In the last section we used an example to introduce some simple yet useful hybridization ideas, showing how to combine different techniques to extract the sky blobs from an image. This a the first step in image understanding: knowing if there is sky or not can lead to considerations on the environment (indoor/outdoor), and the weather (color of the sky, fragmentation of the blobs due to clouds). Even if we are not interested in the presence of sky, this first step can be used to delete from the image useless blobs, and therefore reduce the search space. The combination of two or more of such simple detectors, can generate not trivial context information which can be used in turn to make higher order logical reasoning; those higher order informations then will be the new features, ready to be combined again and generate deeper image comprehension.
2 Simple Low Level Features for Image Analysis
39
(a)
(b)
(c)
(d)
(e)
Fig. 2.12 Superimposition of original image and Canny edge detector (a), original blob (b), edges within the blob (c), blob partition according to edges (d), partition after morphological closing (e).
References 1. Besl,P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14, 239–256 (1992)
40
Paolo Falcoz
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2.13 Original image (a), after color blob extraction (b), after texture filtering (c), after edge combination (d), after second texture filtering (e), final sky blob (f).
2. M. Bover: MPEG-7 Visual Shape Descriptors. IEEE Trans.s on Circuits and Systems for Video Technology, 11, 716–719 (2001) 3. J. Canny: A Computational Approach To Edge Detection. IEEE Trans. Pattern Analysis and Machine Intelligence,8, 679–714 (1986) 4. Chen, Y., Sundaram, H.: Estimating Complexity of 2D Shapes, IEEE 7th Workshop on Multimedia Signal Processing, pp. 1-4, Shanghai (2005) 5. Cheng, H.D., Li, J.: Fuzzy homogeneity and scale-space approach to color image segmentation, Pattern Recognition, 36, 1545–1562 (2003) 6. Faugeras, O. D., Hebert, M.: The Representation, Recognition, and Locating of 3-D Objects. Int. J. Robotics Research, 5(3), 27–52 (1986) 7. Feldmar, J., Ayache, N. J.: Rigid, affine and locally affine registration of free-form surfaces, Int. J. on Computer Vision, 18, 99–119 (1996) 8. Gallagher, A. C., Luo, J., Hao,W.: Improved Blue Sky Detection Using Polynomial Model Fit. ICIP, 4, 2367–2370 (2004)
2 Simple Low Level Features for Image Analysis
41
9. DARPA: http://www.darpa.mil/GRANDCHALLENGE/ 10. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision. Vol. 1, Addison-Wesley, pp. 2848, (1992) 11. Harris, C., Stephens, M.: A combined corner and edge detector. Proc. of the 4th Alvey Vision Conference, pp 147–151 (1988) 12. E. Hecht, Optics, 2nd Edition. Addison Wesley (1987) 13. R. Hirsch: Exploring Colour Photography: A Complete Guide. Laurence King Publishing (2004) 14. Honda: http://world.honda.com/ASIMO/ 15. Horn, B. K. P.: Robot Vision, pp. 69–71, MIT Press (1986) 16. Horn, B. K. P.: Closed Form Solutions of Absolute Orientation Using Unit Quaternions. Journal of the Optical Society of America, 4(4), 629–642 (1987) 17. Hunter, R. S.: Photoelectric Color-Difference Meter. Proceedings of the Winter Meeting of the Optical Society of America (1948) 18. Hunter, R. S.: Accuracy, Precision, and Stability of New Photo-electric Color-Difference Meter. Proc. of the Thirty-Third Annual Meeting of the Optical Society of America, (1948) 19. Kopf, S., Haenselmann T., Effelsberg, W.: Enhancing Curvature Scale Space Features for Robust Shape Classification. Proc. of International Conference on Multimedia & Expo (2005) 20. Luo, J., P. Etz, S.: A physical model-based approach to detecting sky in photographic Images. IEEE Trans. on Image Processing, 11, 201–212 (2002) 21. MPEG-7 Standard: http://www.chiariglione.org/MPEG/standards/ mpeg-7/mpeg-7.htm 22. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremum regions. Proc. of British Machine Vision Conference, pp. 384–393 (2002) 23. Mokhtarian, F., Mackworth, A.K.: Scale-based description and recognition of planar curves and two-dimensional shapes. IEEE Trans. Pattern Analysis and Machine Intelligence, 8, 34– 43 (1986) 24. Mokhtarian F., Mackworth, A.K.: A theory of multi-scale, curvature-based shape representation for planar curves. IEEE Trans. Pattern Analysis and Machine Intelligence, 14, 789–805 (1992) 25. Munsell, A. H.: A Color Notation. G. H. Ellis Co., Boston (1905) 26. Munsell, A. H.: A Pigment Color System and Notation. The American Journal of Psychology, 23, 236-244 (1912) 27. Yerkes, R. M.: Introduction to Psychology. H. Holt, 1911 28. Peng, J., Yang, W., Cao, Z.: A Symbolic Representation for Shape Retrieval in Curvature Scale Space. Proc. of IEEE Int. Conf. on Computational Intelligence for Modelling Control and Automation and Int. Conf. on Intelligent Agents, Web Technologies and Internet Commerce, (2006) 29. Rosten, E., Drummond, T.:Machine learning for high-speed corner detection. Proc. of European Conference on Computer Vision, pp. 430–443 (2006) 30. Smith, A. R.: Color Gamut Transform Pairs. Computer Graphics 12(3) August (1978) 31. Sobel, I., Feldman, G.: A 3x3 Isotropic Gradient Operator for Image Processing. Presented at a talk at the Stanford Artificial Project in 1968, unpublished but often cited, orig. in Pattern Classification and Scene Analysis, R. Duda and P. Hart, John Wiley and Sons, pp. 271–2, (1973) 32. Smith, S.M., Brady, J.M.: SUSAN – a new approach to low level image processing. Int. J. of Computer Vision, 23, 45–78 (1997) 33. Techblog: http://www.techeblog.com/index.php/tech-gadget/ samsungs-200000-machine-gun-sentry-robot 34. Tomasi C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. Proc. IEEE Int. Conf. on Computer Vision (1998) 35. Trajkovic, M., Hedley, M.: Fast corner detection. Image and Vision Computing, 16, 75–87 (1998) 36. Zhang, Z.Y.: Iterative point matching for registration of free-form curves and surfaces. Int. J. on Computer Vision, 13, 119–152 (1994)
42
Paolo Falcoz
37. Zhong, B., Liao, W.: A Hybrid Method for Fast Computing the Curvature Scale Space Image. Proc. of the Geometric Modeling and Processing, (2004)
Chapter 3
Fast and robust Face Detection Marco Anisetti
Summary. This chapter presents a fully automatic face detection system robust to moderate change in expression, posture and illumination. The final goal of this detection is to initialize a 3D face tracking, therefore is specialized for working on videos of good quality instead of still images. More in details we present two different face detection strategy based on slightly modified largely used Viola-Jones [1] object detector. Key words: Face Detection, Feature Extraction, Skin Map
3.1 Related work The face detection is one of the main part of 3D model initialization procedure. In general it allow further features-oriented searching process that permits to obtain the final model initialization. Unfortunately the human face is a dynamic object and has a high degree of variability in its appearance, which makes face detection a difficult problem. The challenges associated with face detection can be attributed to the following factors: 1. Posture of face. The images of a face vary due to the relative camera-face pose (frontal, 45 degree, profile, upsidedown), and some facial features such as an eye or the nose may become partially or wholly occluded. 2. Presence or absence facial features such as beards, mustaches, and glasses increase the variability of facial appearance together with shape, color, and size. 3. Facial expression.
Marco Anisetti University of Milan, Department of Information Technology, Via bramante 65, 26013 Crema Italy e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_3, © Springer Science + Business Media, LLC 2009
44
Marco Anisetti
4. Facial occlusion by other objects. Image lighting conditions or camera characteristics (sensor response, lenses) that affect the appearance of a face. A wide variety of techniques have been proposed, ranging from simple edgebased algorithms to composite high-level approaches utilizing advanced pattern recognition methods. Because face detection techniques requires a priori information of the face, they can be effectively organized into two broad categories distinguished by their different approach to utilizing face knowledge: • Feature-based. These techniques make explicit use of face knowledge and follow the classical detection methodology in which low level features are derived prior to knowledge-based analysis. The apparent properties of the face such as skin color and face geometry are exploited at different system levels. • Appearance based. Address face detection as a general recognition problem. Image representations of faces, are directly classified into a face group using training algorithms without feature derivation and analysis. Unlike the feature-based approach, these approach incorporate face knowledge implicitly into the system through mapping and training schemes.
3.1.1 Feature-based The underlying assumption is based on the observation that humans can effortlessly detect faces and objects in different poses and lighting conditions and, so, there must exist properties or features which are invariant over these variabilities. Numerous methods have been proposed to first detect facial features and then to infer the presence of a face. In general the feature-based approach can be further divided into three areas: low-level, skin-map, feature analysis, and template-based. The low level approach includes a largely used techniques based on colors called skin-map approach. Since the literature importance of this approach it is following treated separately.
3.1.1.1 Low level The analysis first deals with the segmentation of visual features using pixel properties such as gray-scale and color. Because of the low-level nature, features generated from this analysis are ambiguous. In general many low level approach are used as support for more complex system generally exploiting the researchers knowledge of human faces (e.i rules-based system). In general the low level analysis involve: edges, grayscale values (thresholding), color (skin-map region). Other detection algorithms tend to use a mixture of low level techniques [2]. Another interesting work [3] consider the visual features such as edges, color, as derived in the early stage of the human visual system, shown by the various visual response patterns in our in-
3 Fast and robust Face Detection
45
ner retina. This pre-attentive processing allows visual information to be organized in various bases prior to high-level visual activities in the brain. Therefore the machine vision systems should begin with pre-attentive low-level computation of generalized image properties. This work introduces a generalized symmetry operator that is based on edge pixel operation. The symmetry measure assigns a magnitude at every pixel location in an image, this magnitude map clearly shows the locations of facial features. Edge is one of the most used low level features in computer vision applications. Edge detection is the foremost step in deriving edge representation. So far, many different types of edge operators have been applied. The Sobel operator was the most common filter among the edge based techniques [4]. The Marr-Hildreth edge operator is a part of the proposed systems in [5]. A variety of first and second derivatives (Laplacian) of Gaussians have also been used in the other methods. For instance, a Laplacian of large scale was used to obtain a steerable filters in [6]. In an edge-detection-based approach to face detection, edges need to be labeled and matched to a face model in order to verify correct detections. Some other works obtain edges-map using wavelet [7]. Besides edge details, the gray information within a face can also be used as features. The main idea is that facial features such as eyebrows, pupils, and lips appear generally darker than their surrounding facial regions. In these algorithms, the input images are first enhanced by contrast-stretching and gray-scale morphological routines to improve the quality of local dark patches and thereby make detection easier [8]. The extraction of dark patches is achieved by low-level gray-scale thresholding. Yang and Huang [9], on the other hand, explore the gray-scale behavior of faces creating a multiresolution hierarchy of images by averaging and sub-sampling. The low level analysis is based on the fact that at low resolution, face region will become uniform. Starting at low resolution images, face candidates are established by a set of rules that search for uniform regions. The face candidates are then verified by the existence of prominent facial features. One attractive feature of this method is that a coarse-to-fine or focus-of-attention strategy is used to reduce the required computation. The ideas of using a multiresolution hierarchy and rules to guide searches have been influenced many later face detection works [10] Whilst gray information provides the basic representation for image features, color is a more powerful means of discerning object appearance. It was found that different human skin color gives rise to a tight cluster in color spaces even when faces of difference races are considered. This idea is at the base of skin-map approach.
3.1.1.2 Skin-map The skin map approach is based on the idea that a face region can be detected in a color image considering its particular color distribution. This assumption can be easily disproven considering different human race, nevertheless skinmap is generally confining into a particular race by construction, it receives great attention by the researchers. The skin-map approach includes an exterminate literature related works involving color space and color distribution model. A recent work [11] gives
46
Marco Anisetti
an overall analysis of the main issue related these aspects, in particular: color representation, quantization and classification. The skin-map approach can be mainly classifies in two different approaches: • Region-based methods. These methods try to take the spatial arrangement of skin pixels into account during the detection stage to enhance the methods performance [12] [13]. • Pixel based method. These skin detection methods classify each pixel as skin or non-skin individually, independently from its neighbors. Although different people have different skin color, several studies have shown that the major difference lies largely between their intensity rather than their chrominance [8]. Several color spaces have been utilized to label pixels as skin including: • RGB: It is one of the most widely used colorspaces for processing and storing of digital image data. However, high correlation between channels, significant perceptual non-uniformity, mixing of chrominance and luminance data make RGB not a very favorable choice for color analysis and colorbased recognition algorithms. Nevertheless this color space is used in literature [14]. • Normalized RGB: Normalized RGB is a normalized representation obtained from the RGB values using : r = R/(R + G + B), g = G/(R + G + B), b = B/(R + G + B) The third component does not hold any significant information (r + g + b = 1) and can be omitted, reducing the space dimensionality. In this manned the dependance of r and g on the brightness of the source RGB color is diminished. The normalized color component (r and g) are therefore called ”pure colors” This color space present a certain invariant related to the light source (excluding ambient light)[15] • Hue Saturation Intensity (Value, Lightness)HSI, HSV, HSL: This color space describe color with intuitive values of tint, saturation and tone. Hue defines the dominant color (such as red, green, purple and yellow) of an area, saturation measures the colorfulness of an area in proportion to its brightness and the ”intensity”, ”lightness” or ”value” is related to the color luminance. The intuitiveness of the colorspace with explicit discrimination between luminance and chrominance properties made these colorspaces popular. However, this color space has several undesirable features, including hue discontinuities and the computation of ”brightness” (lightness, value), which conflicts badly with the properties of color vision • YCrCb: In this color space the color is constructed as a weighted sum of the RGB values, and two color difference values Cr and Cb that are formed by subtracting luma from RGB red and blue components. In details: Y = 0.299R + 0.587G + 0.114B,Cr = R − Y,Cb = B − Y This transformation evidence an explicit separation of luminance and chrominance components that makes this colorspace attractive for skin color modeling [16]. • CIE LAB, and CIE LUV: These color space are reasonably perceptually uniform colorspaces standardized by CIE (Commission Internationale de LEclairage). Perceptual uniformity means that a small perturbation to a component value is approximately equally perceptible across the range of that value. The well known
3 Fast and robust Face Detection
47
RGB colorspace is far from being perceptually uniform. The price for better perceptual uniformity is complex transformation functions from and to RGB space. Several other linear transforms of the RGB space were employed for skin detection - YES [17], YUV [18] and YIQ [19] , [20]. And less frequently used colorspaces, CIE-xyz [21]. The final goal of skin color detection is to build a decision rule, that will discriminate between skin and non-skin pixels, therefore many methods have been proposed to build a skin color model. The simplest model is to define a region of skin tone pixels using Cr Cb values from samples of skin color pixels. With carefully chosen thresholds, Cr1 ,Cr2 and Cb1 ,Cb2 as ranges, a pixel is classified to have skin tone if its values fall within the ranges [22], Other more complex models use a histogram based approach. The colorspace is quantized into a number of bins, each corresponding to particular range of color component. These bins forming an histogram (3D or 2D depending on color channel used). Each bin stores the number of times this particular color occurred in the training skin images. After training, the histogram counts are normalized, converting histogram values to discrete probability distribution: Pskin (c) = skin(c)/Norm, where Norm is a normalization factor where skin(c) gives the value of the histogram bin, corresponding to color vector c. The normalized values of the lookup table bins constitute the likelihood that the corresponding colors will correspond to skin [23]. Starting from the histogram definition, other approaches use formal probability definition. In fact the value of Pskin (c) computed is a conditional probability of observing color c, knowing that we see a skin pixel (P(c|skin)) Therefore using a Bayesian approach it comes possible to compute a probability of observing skin, given a concrete c color value (P(skin|c)). The probability needed for Bayes rules P(c|skin) and P(c|¬skin) are directly computed from skin and non-skin color histograms. The prior probabilities P(skin) and P(¬skin) can also be estimated from the overall number of skin and non-skin samples in the training set [23]. An inequality P(skin|c) ≥ θ , where θ is a threshold value, can be used as a skin detection rule. The histogram-based skin models are training dependent (representativeness of the training images set) and require much storage space. Other type of skin models, the ”parametric skin distribution models”, provide more compact skin model representation and ability to generalize and interpolate the training data. Several models belong with this category: i) Gaussian, ii) mixture of Gaussian, iii) multiple Gaussian, and iv) elliptic boundary model. In the Single Gaussian approach skin color distribution is modeled by an elliptical Gaussian joint probability density function (pdf). Gaussian mixture model is a more sophisticated model, capable of describing complex-shaped distributions, based on the generalization of the single Gaussian, the pdf in this case is: p(c|skin) = ∑ki=1 πi pi (c|skin) where k is the number of mixture components, πi are the mixing parameters, obeying the normalization constraint ∑ki=1 πi = 1 and pi (c|skin) are Gaussian pdfs, each with its own mean and covariance matrix. Skin classification is done by comparing the p(c|skin) value to some threshold.
48
Marco Anisetti
Multi Gaussian technique approximate the skin color cluster with three 3D Gaussians, the training is performed using a variant of k-means clustering algorithm. Mahalanobis distance is used for skin classification, evaluating the distance from the closest model cluster center [24]. Elliptic model tries start from the concept that the skin distribution is approximately elliptic in shape is not well enough approximated by the single Gaussian model. The ”‘elliptical boundary model” [23] is equally fast and simple in training and evaluation as the single Gaussian model but seems to be more efficient compared both to single and mixture of Gaussians. In the parametric methods, the goodness of fit is more dependent on the distribution shape, and therefore colorspace used, than for non-parametric skin models. Finally a family of skin modeling methods is called ”Dynamic skin model”. This family of skin modeling methods was designed and tuned specifically for skin detection during face tracking. In fact the skin model can tuned for one concrete situation: person, camera and lighting. Therefore the model can be more specific. The basic idea is to obtain skin classification model, that is optimal for the given conditions (using some initialization procedures). Since there is no need for model generality, it is possible to reach higher skin detection rates with low false positives with this specific model, than with general skin color models, intended to classify skin in totally unconstrained images set [23]. On the other hand, skin color distribution can vary with time, along with lighting or camera white balance change, so the model should be able to update itself to match the changing conditions.
3.1.1.3 Feature analysis Features generated from low-level analysis are likely to be ambiguous. In this type of approach the visual features are organized into a more global concept of face and facial features using information of face geometry. Through feature analysis, feature ambiguities are reduced and locations of the face and facial features are determined. Some interesting techniques [25] [26] are based on feature searching process that start with the determination of prominent facial features. The detection of the prominent features then allows for the existence of other less prominent features to be hypothesized using anthropometric measurements of face geometry. These algorithms generally rely extensively on heuristic information taken from various face images modeled under fixed conditions. In complex situation these algorithms will fail because of their rigid nature. Some face detection research address this problem by grouping facial features in face-like constellations using more robust modeling methods [27][28][29]. Other interesting approach [30]deals with gradient-type operator over local windows and converting the input images to a directional image. From this directional image they apply a two stage face detection method consisting of a generalized Hough transform and a set of 12 binary templates representing face constellations.
3 Fast and robust Face Detection
49
3.1.1.4 Template based Several standard patterns of a face are stored to describe the face as a whole or the facial features separately. In template matching, a standard face pattern (usually frontal) is manually predefined or parameterized by a function. The correlations between an input image and the stored patterns are computed for detection. The existence of a face is determined based on the correlation values. Some Template techniques take advantage of more simple and statical model. In many cases this statical model is related to frontal-view face (i.e., the outline shape of a face). In general these methods take advantages from edge map. Therefore in many cases the initial analysis can be classifies in the low level family. In [4] an edge grouped approach is used with several constraint for face’s template search. The same process is repeated at different scales to locate features such as eyes, eyebrows, and lips. In [31] facial features searching process is based more on a control strategy to guide and assess the results from the template-based feature detectors. In a similar way [5] uses a linked neighborhood oriented process to define the facial template. Other work include the concept of edges in a silhouettes template [32]. A set of basis face silhouettes is obtained using principal component analysis (PCA) on face examples. These eigen-silhouettes are then used with a generalized Hough transform for localization. More recently an interesting work use set of silhouette oriented features (edgelet features) for detector learning applied for full body detection [33], but easily adaptable for face detection. To overcome the frontal limitation of the previous work an hierarchical template matching method for face detection was proposed in [34]. A multiresolution image hierarchy is formed and edges are extracted using the Laplacian operator. The face template consists of the edges produced by six facial components: two eyebrows, two eyes, one nose, and one mouth. Finally, heuristics are applied to determine the existence of a face. Other interesting techniques uses deformable template. In this approach, facial features are described by parameterized templates but the deformable template must be initialized in the proximity of the object of interest. In [35], the detection method is based on snakes obtained over a blurred images with edges enhanced with morphological operator. Each face is approximated by an ellipse and a Hough transform of the snakelets is used to find a dominant ellipse. Some template approach described a face representation method with both shape and intensity information [36] [37]. They are borderline with appearance based method, we refer to these as ”appearance template” model. In fact they construct a template model by training and use a Point Distribution Model (PDM) to characterize the shape vectors over an ensemble of individuals, and an approach to represent shape normalized intensity appearance. A face-shape PDM can be used to locate faces in new images by using Active Shape Model (ASM) [38] search to estimate the face location and shape parameters. The face patch is then deformed to the average shape, and intensity parameters are extracted. The shape and intensity parameters can be used together for classification.
50
Marco Anisetti
The most recent works in the field on deformable template are based on AAM techniques [39] [40]. In this approach facial shape and appearance are modelled independently and localization is performed as a non-linear error function optimization. Although they seem to be ideal as a final localization step, reliable face detection using another method has to be employed first, since these iterative methods need a good initial position and size estimate to converge [41] [42].
3.1.2 Appearance-based In the Appearance-based methods, the face class is modeled as a cluster in a highdimensional space where the separation from a non-face class is carried out using various classifiers. Huge training sets are required to learn the decision surface reliably. The imaging effects (scale, rotation, perspective) are removed in the upperlevel of the system by using a so called ”scanning window”. The concept of the scanning window is the root idea of these methods. Since it is not computationally feasible to scan all possible scales and rotations, discretization of scale and rotation has to be introduced. It is exactly this operation which makes the modeling of face appearance difficult and prone to false detections and misalignments. Images of human faces lie in a subspace of the overall image space. To represent this subspace, one can use neural approaches, but there are also several methods more closely related to standard multivariate statistical analysis which can be applied including Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Factor Analysis (FA). Since the face reconstruction by its principal components is an approximation, a residual error (distance-from-face-space (DFFS)) is defined in the algorithm as a preliminary measure of ”faceness”, gives a good indication of face existence [43]. Similar idea was proposed in [44] for facial feature detector using DFFS generated from eigenfeatures (eigeneyes, eigennose, eigenmouth) obtained from various facial feature templates in a training set. A variant of eigenface uses Fishers Linear Discriminant (FLD) to project samples from the high dimensional image space to a lower dimensional feature space [45] Several methods have been proposed based on idea that linear face space representation can be improved using a face space represented by subclasses. Many of these are based on some mixture of multidimensional Gaussians [46]. Of course since the pattern recognition nature of the problem, also many neural network approaches were proposed. The first neural approaches to face detection were based on Multi-Layer Perceptron MLPs presents promising results only on fairly simple datasets [47]. The first advanced neural approach which reported results on a large, difficult dataset was [48]. In [49] a probabilistic decision-based neural networks was proposed. This is a classification neural network with a hierarchical modular structure. One of the more representative method based over a naive Bayes classifier is [50][51]. It described a naive Bayes classifier to estimate the joint probability of local appearance and position of face patterns (subregions of the face) at multiple resolutions. This techniques emphasizes local appearance because some local patterns of an object
3 Fast and robust Face Detection
51
are more unique than others; the intensity patterns around the eyes are much more distinctive than the pattern found around the cheeks. One of the largely used detection method, that already define a standard for face detection is the Viola Jones approach [1][52]. This approach describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. The main contributions are: i) the introduction of a new image representation called the ”Integral Image” which allows the features used by our detector to be computed very quickly, ii) a simple and efficient classifier which is built using the AdaBoost learning algorithm to select a small number of critical visual features from a very large set of potential features and iii) a method for combining classifiers in a ”cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. Recently many extension to Viola Jones detector were proposed relied on different formulation of Haar like features, from a tilted version [53], to a Joint formulation of more features [54]. Several extensions to detect faces in multiple views with in-plane ration have been proposed [55][56]. Despite the excellent run-time performance of boosted cascade classifier, the training time of such a system is rather lengthy. Numerous algorithms have been proposed to address these issues and extended to detect faces in multiple views. To handle the asymmetry between the positive and negative data sets, Viola and Jones proposed the asymmetric AdaBoost algorithm [57] which keeps most of the weights on the the positive examples. In [52], the AdaBoost algorithm is used to select a specified number of weak classifiers with lowest error rates for each cascade and the process is repeated until a set of optimization criteria (i.e., the number of stages, the number of features of each stage, and the detection/false positive rates) is satisfied. As each weak classifier is made of one single Haar-like feature, the process within each stage can be considered as a feature selection problem. Instead of repeating the feature selection process at each stage, Wu et al. [58] presented a greedy algorithm for determining the set of features for all stages first before training the cascade classifier, this drastically reduces the training time. Recently, Pham and Cham proposed an online algorithm that learns asymmetric boosted classifiers [59] with significant gain in training time. In [60], an algorithm that aims to automatically determine the number of classifiers and stages for constructing a boosted ensemble was proposed. Although the original four types of Haar-like features are sufficient to encode upright frontal face images, other types of features are essential to represent more complex patterns like faces in different pose. [55] [56]. Most systems take a divideand-conquer strategy and a face detector is constructed for a fixed pose, thereby covering a wide range of angles. A test image is either sent to all detectors for evaluation, or to a decision module with a coarse pose estimator for selecting the appropriate trees for further processing. The ensuing problems are how the types of features are constructed, and how the most important ones from a large feature space are selected. We chose to follow the Viola Jones approach since the performance are compatible with a video streaming approach and the quality of results are satisfactory. The
52
Marco Anisetti
Viola Jones approach permits to define a general object detection framework that can be used also for feature identification and location, therefore becomes useful also for our automatic mask fitting process.
3.2 Introduction Every type of tracking problem requires obviously, target detection before starting the tracking process. In general target detection may requires further analysis (i.e. contours detection) for a better target location. This refinement permits to achieve better precision in the consecutive tracking process (i.e drift attenuation). In this chapter we focus on face detection as initialization for 3D tracking process, therefore we are focused on detecting faces inside a video stream. In our case the percentage of facial detection in a single image has little importance in comparison with the location quality. We rely on the fact that, in a reasonable period of time the face presents in a scene which could be in a situation that permits to be detected (in fact the principal causes of missed faces is related to situational causes: posture illumination and expression for example). Therefore, we put more emphasis on issues coupled with the accuracy of face location (in a precise location sense and in terms of posture and face morphology) instead of detection quality (face detected vs. face really present in the image).
3.3 Face detection on video stream Our goal is to detect probable faces inside a scene in such a way that it can be used as starting point for tracking procedure. Therefore need to have a fast detection because we focus our attention on video sequences instead of still images (i.e real time application in the surveillance environment). More in details since we use the face detection as initialization for tracking, we force the emphasis on low False Positive rate instead of True Positive. In [61] a face detection system in video sequences is proposed. This work uses two probabilistic detectors, one frontal and one for profile, based on [62] and a condensation algorithm to propagate the probability of detection and parameters over time. This approach include the concept of tracking inside the propagation probability. In our case since the detection is the starting point for more complex tracking we focus our research on frame based face detection instead of video based detection. In literature there are several face detection algorithms, one simple classification is done dividing the algorithm that detect faces via features localization (presence of eyes, mouth etc.) called ”Feature-based”, and the algorithms that try to detect the entire faces without focusing on the features called ”Appearance-based”. We propose two detection schema:
3 Fast and robust Face Detection
53
• Appearance-based face detection focused on facial detection without preliminary macro-feature localization. • Feature-based face detection. Both schemas require an object detection approach, the first one to detect the face in an image, the other to validate the feature selected and for final validation. Regarding features localization, in the case of feature-based face detection, some features location are already discovered, but many others are not located, meanwhile in the case of appearance-based face detection we only know the faces position and the facials extension without any reference to features position. Therefore both cases require feature localization. Our feature localization strategy strongly relies on object detection approach, our aim is to detect the most important facial macro-features (eyes, eyebrows, mouth) and to individuate some crucial point-like features like the corner of the eyes, mouth etc. These point-like features can be easily linked to appropriate triangular vertex of our 3D face model building the correlation at the base of model fitting approach. Since the use of object detection is slightly different for the two detection approach and from feature detection, we first present our general object detection algorithm and than describe in details the uses and the modification for each cases.
3.3.1 Efficient object detection Our object detection is based on the Viola and Jones [1] detector. This detector is a largely used real-time object detector1 its main features can be sum up as: i) image representation called the ”Integral Image” which allows the features used by the detector to be computed very quickly, ii) a simple and efficient classifier which is built using the AdaBoost learning algorithm [63] to select a small number of critical visual features from a very large set of potential features, iii) a method for combining classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. Viola and Jones uses a simple rectangular features that are a reminiscent of Haar basis functions which have been used in [64]. In particular three type of feature are used: i)Two-rectangle feature (vertical or horizontal) whose values is the difference between the sum of the pixels within two adjacent rectangular regions. ii)Three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. iii) four-rectangle feature computes the difference between diagonal pairs of rectangles. Classical Viola Jones approach uses 4 Haar like features, in our cases we use 5 features. Every feature used has a positive or negative evaluation region. The characteristic of our features is that the global sign is zero. Fig. 3.1 shows these type of features. In general this is not a
1
Depending on the training many object can be detected using this approach. In our work we use it both for face detection than for features validation.
54
Marco Anisetti
constraint. In fact it is possible to have a non normalized haar features and having a good detection performance.
Fig. 3.1 Haar like features. The sum of sign is equal to zero foe every features.
On the other hand choosing different type of features can improve the quality of a specific detection process. Our 5 Haar like feature guarantee a good generality for different type of object to be detected. Of course the Haar feature can be customize for each detection target. The Haar features are computed efficiently using integral images approach. The integral image ii(x, y) at location x, y contains the sum of the pixels above and to the left of x, y, inclusive: ′ ′ (3.1) ii(x, y) = ∑ i(x , y ) ′
′
x ≤x,y ≤y
where i(x, y) is the original image. Using the following pair of recurrences: s(x, y) = s(x, y − 1) + i(x, y)
(3.2)
ii(x, y) = ii(x − 1, y) + s(x, y)
(3.3) (3.4)
where s(x, y) is the cumulative row sum, s(x, −1) = 0, and ii(−1, y) = 0 the integral image can be computed in one pass over the original image. Features are extracted from sub windows of a sample image. The base size for a sub window is 24 by 24 pixels. Each of the four feature types are scaled and shifted across all possible com˜ binations. In a 24 pixel by 24 pixel sub window there are 160,000 possible features to be calculated. Fortunately a very small number of these features can be combined to form an effective classifier. Viola Jones approach used AdaBoosting both to select these features and to train the classifier. In its original form, the AdaBoost learning algorithm is used to boost the classification performance of a simple learning algorithm by combining a collection of weak classification functions to form a stronger classifier. In the Viola Jones approach this weak classifiers depend only on a single
3 Fast and robust Face Detection
55
feature. For each feature, the weak learner determines the optimal threshold classification function, such that the minimum number of examples are misclassified. A weak classifier (h(x, f , p, θ )) thus consists of a feature ( f ), a threshold (θ ) and a polarity (p) indicating the direction of the inequality: 1 if pf(x)¡p θ (3.5) (h(x, f , p, θ )) = 0 otherwise where x is the 24×24 pixel sub-window of an image. In practice no single feature can perform the classification task with low error. Considering a set of example images (x1 , y1 ), . . . , (xn , yn ) where yi = 0, 1 for negative and positive examples respec1 1 , 2l foryi = 0, 1 respectively, where m and l are the tively, initialize weights w1,i = 2m number of negatives and positives respectively. The boosting loop for t = 1, . . . , T include: • Normalize the weights wt,i ← wt,i / ∑nj=1 wt, j • Select the best weak classifier from the entire set with respect to the weighted error εt = min f ,p,θ ∑i wi |h(xi, f , p, θ ) − yi |. • Define ht (x) = h(x, ft , pt , θt ) where ft , pt , and θt are the minimizers of εt . • Update the weights: wt+1,i = wt,i Bt1−ei where ei = 0 if example xi is classified correctly, ei = 1 otherwise, and Bt = εt /(1 − εt ). After obtaining the set of T classifiers, the final strong classifier is: T T 1 ∑t=1 αt ht (x) ≥ 12 ∑t=1 αt C(x) = 0 otherwise
(3.6)
where αt = log(1/Bt ) This initial AdaBoost threshold, is designed to yield a low error rate on the training data. A lower threshold yields higher detection rates and higher false positive rates. At each round the best classifier on the current iteration is selected and combined with the set of classifiers learned so far. The final strong classifier takes the form of a perceptron, a weighted combination of weak classifiers followed by a threshold. To improving the quality of the detection rate and minimize the computation time, Viola and Jone define a cascade of classifiers(see Fig. 3.2). Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false positive rates . Each layer of the cascade are constructed by training classifiers using AdaBoost as described. Viola and Jones start with two-feature strong classifier so that a object filter can be obtained by adjusting the strong classifier threshold to minimize false negatives. The Viola Jones framework for training the classifier allows user to select the maximum acceptable rate for fi (the false positive rate of the ith classifier) and the minimum acceptable rate for di (the detection rate of the ith classifier). The number of features used being by each classifies increased until the target detection and false positive rates are met for this level. If the overall target false positive rate is not yet met then another layer is added to the cascade.
56
Marco Anisetti
Fig. 3.2 Example of cascade classifier.
3.3.2 Appearance-based face detection The object detector described is used for different validation-detection purposes. In particular in the case of face detection, some modification to object detection approach described is introduced. In particular our extension include: i) preliminary sub-window analysis, ii) ad hoc initial cascade’s stages definition. The integral image approach defined by Viola and Jones permits to compute with high efficiency the integral of a sub-window while face searching is ongoing. During searching process the sub-window is extracted after scaling and translating inside the frame area. This scanning process is therefore controlled by two type of parameter: translation and scale. In our previous work [65] we study the influence on the detection results by changing this two parameters, in relation with the computational time. Of course the integral image is also used for computing the Haar feature values inside a selected sub-window. Before starting the first detection cascade stage on a sub-window, a simple condition is evaluated. This condition is related to the variance. In fact largely used concept assert that the face area region is characterized by a high variance while many non face areas are characterized by low variance. This permit to discard the sub-window without performing the entire stage evaluation. In general the result of a scanning process is a set of probable face location that are overlapped, this overlapped region depend on the granularity of scale and translation. The final face location is detected by evaluating the overlapped region. Regarding the cascade creation process, we force the first three stage to be a one weak classifier based. This is equivalent of defining an ad hoc cascade filtering for the first stages. The same ideas is recently exploited by [58] mainly for reducing the training time and decouple the features selection process from cascade design. In our case the weak classifier for the initial stages is trained for having the maximum number of true positive pay with high false positive (Fig. 3.3 shows a portion of pos-
3 Fast and robust Face Detection
57
Fig. 3.3 An example of a face training generated from internet.
itive faces). This strategy permits to discard great number of non face sub-window demanding the refinement to the following cascade’s stage. This approach can be reiterate for several stage depending on the training set and on the tuning decision adopted. Fig. 3.4 shows an example of our first three weak classifier feature applied to a face sub-window.
Fig. 3.4 Examples of the first three 1 weak classifier cascade stages. From top left to bottom left the three Haar feature used for the classifier (black positive, yellow negative), The last bottom right figure shows the patch under analysis.
58
Marco Anisetti
Our training is based on near to frontal face. Therefore the detection cascade works only under this condition. In literature there are different works that try to introduce some posture (specially tilting) invariant property inside the Haar based classifier [53]. In our work we chose to tilt the frame and re-performing the searching process. This is equivalent to extending the scanning approach include the tilted parameter [66]. Of course this approach slightly complicate the problem of the overlapped detection areas. Fig. 3.5 shows an example of multiple face detection due tilting parameter. The final detection is simple obtained by evaluating the overlapping area.
Fig. 3.5 Multiple face detection due tilting in BioId database [68]. The detection rectangle define the detection area under different tilting.
3.3.3 Features-based face detection Our features based detection algorithm relies on the following steps: • Adaptive skin detection: aimed to detect the blobs that include faces. Our adaptation strategy of skin model permits to cover most changes in illumination (when in white light source). • Eyes detection and validation: to detect eyes pair, one of the meaningful facial features, inside the skin blobs and validate them using object detection approach trained for the eyes pair region. • Face normalization: performs face’s localization and normalization. The main idea of this multi-step algorithm is to obtain an coarse-to-fine classification. Taking into consideration this idea the classification starts from rough to precise face detection, with the correspondent growing levels of reliability of correct classification. The color is used only on the first step (Adaptive skin detection), the others steps works with grayscale images. Following we present a detailed description of every stage.
3 Fast and robust Face Detection
59
3.3.3.1 Adaptive Skin detection Human skin color has been used and proven to be an effective feature for face detection. In fact color is highly robust to geometric variations of the face pattern and allows fast processing. In addition human skin has a characteristic color, which is easily recognized by humans. So trying to employ skin color modelling for face detection was an idea suggested both by task properties and common sense. In the other hand the skin color is an efficient tool for identifying facial areas if model can be properly adapted for different lighting environments. However, many skin color models, are not effective where the spectrum of the light source varies significantly. In other words, color appearance is often unstable due to changes in both background and foreground lighting. The scope of ” adaptive skin detection” is mainly focus on the searching area reduction. This approach doesn’t focus on precise detection of face contours. It only attempts to detect areas that include possible faces. Our skin detection strategy belongs to a ”parametric methods” and works with single Gaussian 2 on YCbCr color space. In particular the Gaussian joint probability density function (pdf), defined as: p(c|skin) =
1 T −1 1 · e− 2 (c−µs ) σs (c−µs ) 1/2 2π|σs |
(3.7)
Where c is a color vector and µs and σs are mean vector and covariance matrix respectively. 1 n 1 n µs = ∑ c j σs = (3.8) ∑ (c j − µs )(c j − µs )T 2 j=1 n − 1 j=1 where n is the total number of skin color samples c j . The p(c|skin) probability is a measure of how skin-like the c color is. For every color c of pixel x of the images we obtain the skin probability (or rather what belongs to the skin’s region defined by the Gaussian). Using a threshold on this probability we obtain the skin blobs. This type of skin-map often suffers of illumination problems and changing in skin color due racial belongings. To reduce these type of disturbance we take inspiration form ”dynamic skin model” approach. Regarding the racial related issue, we train our Gaussian using different racial skins. Of course this approach will increment the probability of including more non-skin color inside the Gaussian area. Our policy is, on the other hand, to maximize the probability of defining a skin region that include a facial skin and not to minimize the non skin color that belongs to a skin region. So we accept hight false positive areas by maximizing the true positive. Respecting the illumination issue, only taking in consideration white light and the digital camera’s white compensation, we perform what we call an ”adaptive skinmap”. To perform this adaptation, we arrange the skin mean value adjustment in this manner: i) compute the skin-map, ii) dilate the skin-map region, iii) recompute the Gaussian’s parameters considering also the color of the pixel under the new enlarged 2
We do not use a mixture of Gaussian because we do not need to reach a great precision but we need to save computational time.
60
Marco Anisetti
region. Considering µenl and σenl as mean and covariance of the pixel under the enlarge region, the new µnew and σnew are: −1 −1 σnew = σs−1 + σenl −1 × µenl µnew = σnew σs−1 × µs + σenl
(3.9) (3.10)
This process could produce an enlargement of the skin region that can include some non face areas. This is not a problem since our goal is not to detect the perfect face region but only to reduce the areas of face searching in respect to an entire frame. The main advantage is that in many cases this adaptation compensates a light change (Fig. 3.6). From our experience, even though this skin colour adjustment enlarges the blobs area, the eye region (even for a few pixel) still remains out of the skin-map region. We then perform some simple morphological operations automatically to obtain a more precise region of skin and non skin region inside a skin blob (Fig. 3.6).
(a) Initial images
(b) Normal Skin map
(c) Adapted Skin map
(d) Skin blob after several morphological operation
Fig. 3.6 Skin map process for blob detection.
To conclude, the skin-map reduces the frame area to some sub areas that contain a probable face.
3 Fast and robust Face Detection
61
3.3.3.2 Eyes detection and validation After the rough preliminary facial area localizations, we obtain the image portions where a possible face can be located. These very rough localizations do not depend upon the face’s posture. At this stage we do not therefore restrict our face detection strategies to only frontal face according to our coarse-to-fine and progressively selective strategy. This can’t be obtained using the appearance-based face detection approach where a complete scanning of the images must be carried out and only face with posture in accord with training set can be detected. Our assumption is that if there are two eyes inside a skin region, that skin region can include a probable face. To find the eyes inside the skin region we use the non-skin areas included in every skin blobs. Each non-skin area is our possible eye location. The main idea is to detect a pair of non-skin regions that can possibly be the eyes of a face. This searching is performed intelligently to reduce the possible pair candidates with heuristic based on the relative blobs’ positions. Summarizing eyes searching process is performed as follows: i) Computing every probable pairs of non-skin region, ii) Computing the centroid of every non-skin region for each pairs, iii) Computing the affine transformation for normalizing the box containing the eyes, iv)Verify with object detection approach (trained with eyes pair images) the normalized region. In particular, we compute the centroid of every non skin blob inside a skin blob region (red dots in Fig.3.7). Every possible pair of centroid ci , c j , where ci = [x, y] is a vector of coordinates for the ith centroid, are taken into account as probable pair of eyes if |ci + c j|2 < thre. The threshold thre defines the minimum detectable face dimension and is defined experimentally.
(a) Initial images
(b) Adaptive skin blob
(c) No skin blobs with centroid
Fig. 3.7 Eyes localization via non skin blobs. In red the probable eyes centroid.
This thresholding approach reduce the number of pair candidates. Our final aim is to validate the eyes pairs with object detection approach, therefore we need to obtain a eyes pair patches to be classified. This patches need to be normalized for classification purposes. We adopt an affine transformation for mapping the actual eyes region to a normalized one (Fig. 3.9 shows some normalization). An Affine transformation is a geometrical transformation which is known to preserve the parallelism of lines but not lengths and angles. Usually this transformation is represented in homogeneous coordinates (i.e. (x, y, 1)) because the transformation of point by any affine
62
Marco Anisetti
transformation can be expressed by the multiplication of a 3x3 matrix A and a 3x1 point vector. ′ Scos(θ ) − 1 −asin(θ ) tx x x ′ y = sin(θ ) Scos(θ ) − 1 ty · y (3.11) 1 0 0 1 1 Where S is the scale factor, a is the sharing, θ represents the rotation angle while tx and ty the translation vectors. Fig. 3.8 shows the relation between normalized region and a pair of eyes candidate. To obtain the affine transformation in 3.11 considering a couple of centroid c1 = [x1 , y1 ], c2 = [x2 , y2 ] following the definition presented in Fig. 3.8:
Fig. 3.8 Pair of eyes normalization. On the left the referenced eyes model; on the right a pair of centroid taken into account for normalization.
= x2 − x1 = y2 − y1 = dx /dre f = a1 ∗ dy /dx = x1 + a2 ∗ ysre f − xsre f ∗ a1 − 1 = y1 − a2 ∗ xsre f − ysre f ∗ a1 − 1 a1 − 1 −a2 a3 A = a2 a1 − 1 a4 0 0 1
dx dy a1 a2 a3 a4
(3.12)
Applies affine A we obtain the normalized eyes pairs patch. These eyes region candidates need to be verified. Fig. 3.9 shows some probable eye regions after the affine normalization and the relative face detected using AR database. We build a large training set with positive and negative examples. The training set is composed by the eyes pairs images labelled manually. Generally speaking any kind of machine-learning-like approaches could be used to learn a classification function (for instance neural network or support vector machines). We chose to follow the approach described in object detection, where some computed efficient Haar features associated with each eyes images are used. The dimension of the normalized eyes candidates region is the same as the training patch so that we avoid scanning problem. The cascade classifier is composed in different stages.
3 Fast and robust Face Detection
63
Fig. 3.9 Some example of probable pairs of eyes after normalization and the relative face after the eyes classification in AR database [67].
Fig. 3.10 Examples of Haar features applied to eye pair classification.
Each stage, called a strong classifier, is trained by the AdaBoost algorithm with the following rules: To reject 60% of negative items and to insure that we detect at least 99.8% of eye pairs. Our first two stage is fixet for uses only 1 Haar-like features. For the following stages, the complexity of detection and the number of features needed increase. Fig. 3.10 shows the fist five Haar features used for verification. Using the cascade permits to quickly drop several non-eye pairs in the first stages and very few eye pairs. At this step we do not need to reach a great reliability in the classification so we decide to use only a three stage Boosting for speeding-up every operation. We accept some false positive but maximize the tru positive detection. In our experiment this classifier permit to discard the 89% of false eye pairs with no significant percentage of correct eye pairs discarded. Fig. 3.11 shows some detected faces in a image data base build with famous people found on internet.
Fig. 3.11 An example of a face detected with posture or even expression different from the frontal neutral view. Note that this face is already normalized in terms of tilting.
64
Marco Anisetti
3.3.3.3 Face normalization For every pair of eyes we reconstruct a face using the human face morphology ratios. We use a similar strategy of normalization adopted by eyes localization (e.i. affine transformation). In this manner we obtain a face areas used for facial features localization. The probability of having a face inside the location defined by normalization is experimentally very hight, nevertheless it is reasonable to perform a validation using a appearance-based detection like classifier as occurs in the case of eye pair. The goal of this verification is to increase the reliability of the face detection system. This degree of reliability is estimated observing at which stage a candidate’s face is rejected. More the candidate passes stages, greater is the reliability degree. Since we are dealing with a sub-window that probably contain face, the validation cascade is trained is such a way that the first stage is already quite robust. On the other hand in several case the eyes are visible while the mouth is occluded. In this situation the validation (if the training set dose not include mouth occluded examples) generally fails while the face location correctly detect the face. In this sense the feature based location is more robust than the object based one. The validation is optional since in some cases can drive to a wrong results due the training lack. In the table 3.1 we present the detection results on the AR [67] data base that is focused on strong occlusion to identity. As it can noticed the correct detection results on images with black sunglasses is not present at all, and for images exposed to a yellow illuminant are really poor. This because we worked on the hypothesys of white illuminant and visible eyes. We perform this test even is our approach is not focused on solving this particular type of problem mainly for the shake of completeness Summing up, in order to keep the false positive rate of 0.28% we paid with a rate of true positive 72.9% without considering black sunglasses and yellow illuminat images.
3.4 Experimental results In this section we present of face detection and feature location algorithms discussed in this Chapter. We have conducted several tests for evaluating our face detection approach using different largely used literature databases. In particular we use: 1. BioID [68] is recording with special emphasis on ”real world” conditions, therefore the testset features include a large variety of illumination, background and face size. In particular the dataset consists of 1521 gray level images with a resolution of 384x286 pixel. Each one shows the frontal view of a face of one out of 23 different test persons. BioID is used for feature localization since already labelled with the eye position, and others features positions. 2. The MMI Facial Expression database [69] holds over 2000 videos and over 500 images of about 50 subjects displaying various facial expressions on command.
3 Fast and robust Face Detection PhotoIndex N. CorrectDetection N. Images 1 99 135 2 113 135 3 102 135 4 104 135 5 24 135 6 14 135 7 4 135 11 70 135 12 11 135 13 4 135 14 92 119 15 102 120 16 93 120 17 93 120 18 21 120 19 14 120 20 1 120 24 61 120 25 11 120 26 9 120
65 Characteristic Smile OpenMouth, CloseEyes YellowLightRight YellowLightLeft YellowLight Scarf Scarf, YellowLightRight Scarf, YellowLightLeft Smile OpenMouth, CloseEyes YellowLightRight YellowLightLeft YellowLight Scarf Scarf, YellowLightRight Scarf, YellowLightLeft
Table 3.1 AR database correct recognition results.
The databases is provided with a FACS labelling for expression recognition as in the case of Cohn-Kanade ones. 3. Cohn-Kanade database [70] consist of approximately 500 image sequences from 100 subjects, the image sequence are taken with a video camera. Subjects range in age from 18 to 30 years. Sixty-five percent were female; 15 percent were African-American and three percent Asian or Latino. Many wrong white compensation videos. The databases is provided with a FACS labelling for expression recognition, in particular subjects were instructed by an experimenter to perform a series facial displays that included single action units (e.g., AU 12, or lip corners pulled obliquely) and action unit combinations (e.g., AU 1+2, or inner and outer brows raised). Each begins from a neutral or nearly neutral face. 4. Hammal-Caplier [71] holds about 50 short image sequences from 16 subject, in the image sequences a complex ecological expression are presented. The main expression presented are: disgust, joy and fear. 5. CMU-MIT database [48] consists of 130 images with 507 ground truth labeled faces. It is one of the ”de facto” standards for face detection approach comparison. We propose two face detection strategy one usable in colored images and the other in both color or grayscale. In both two case, the goal is not to obtain the best Receiver Operating Characteristic (ROC) [72] curve on a specific database of faces like the standard FERET [73][74] database or database MIT-CMU [48] (the de facto standard for comparison) but, since this detection is only for making the 3D face model tracking automatic, our aim is to produce a face detect in real time with good precision and the lowest False Positive rate.
66
Marco Anisetti
In this sense we evaluate our strategy on every database described before and evidence the results in terms of True Positive (TP) and False Positive (FP). These indexes regard the percentage of correct classified faces and the number of the non-faces wrong classified as faces respectively. Other characteristic that differentiate our detection evaluation from common experiments section on face detection are the number of the people present in the scene that are mainly one or two. In fact we are more focused on the Human Machines application of tracking algorithm nevertheless we perform some tests also with more than one face in a scene using our own laboratory database. We consider the both two detection strategy (the Appearance-based Face Detection (AFD), and the Features-based Face Detection (FFD)). In case of colored database we presents the results in terms of Features-based face detection otherwise the results are related to Appearances based ones. Furthermore in the case of video database, we consider the true positive and false positive out of the first 25 frames, if no TP can be detected in the first 25 frames, than that faces is considered undetectable. This is quite restrictive for real environment with 25 fps but useful for TP vs. FP analysis. More in details results for BioID databases and Cohn-Kanade database are related to AFD (grayscale databases) while results for MMI databases and Hammal-Caplier database are related to FFD (colored databases). DB
TP rate FP number
BioID 95.5% MMI 99% Cohn-Kanade 98% Hammal-Caplier 100%
43 4 12 1
Table 3.2 Face detection and facial features location results. True Positive (TP) rate, False Positive (FP) number.
The results on Hammal-Caplier database strongly depend on the high quality and low number of experiments taken into account, a similar discussion about the quality is valid for MMI database. Since the database BioID seems the more complex, we focus our attention on experiment related to it. In [75] the BioID database is used for testing the face detection results using the Hausdorff distance as a similarity measure between a general face model and possible instances of the object within the image. This approach is two-step process that allows both coarse detection and exact localization of faces (more in details coarse detection with a face model and refinement of the initially estimated position with an eye model). They consider a face correctly localized if the eyes position do not differ from the ground truth at list for 0.25 % of the interocular distances e obtaining 91.8% of good detection rate. In [76] the same approach is improved with genetic algorithms obtaining a 92.8%. In [77] an eyes-oriented detection archive the 94.5 % of detection rate, while in [78] 95.67% is archived with 64 False Positive. Other interesting techniques developed in [79] archive the 97.75 % with only 25 FP, while more recently, and more similar to our approach, [80] an approach based on
3 Fast and robust Face Detection
67
asymmetric Haar features archive the 93.68 % of TP rate with 432 FP. For a shake of conciseness we report this comparison in the following table 3.3. Methods [75] [76] [77] [78] [79] [80] our
TP rate 91.80% 92.80% 94.5 % 95.67% 97.75% 93.68% 95.5%
FP number Not reported Not reported Not reported 64 25 432 43
Table 3.3 Comparison of different face detection methods on BioID database
Considering our goal the results, even if not the best in literature for static image database, perfect fits with our requirements as detection in a video. For the shake of completeness we also include a literature comparison of our face detection system and some other well known detection approach. Our system is, of course, non trained for having the best results in such noisily environment, since our goal is detection in streaming videos, nevertheless a comparison still interesting. The databased used for testing is the CMU-MIT database [48] consists of 130 images with 507 ground truth labeled faces. Faces are presented with different posture and also dimensions, in general several faces are presents in image (see Fig. 3.12) and also some drawn faces that in our case we do not consider to be detectable. The quality of images are generally poor and some times adulterated since printed over a low quality paper (some images are obtained by scanning process).
Fig. 3.12 Examples of face detection on CMU-MIT database. Starting from left image containing different scale faces with slightly different posture, the middle image contains two faces in a complex brochure while the right one contains some drawn faces and a real one.
Table 3.4 shows the literature comparison of detection results over CMU-MIT database. Even not build for having a great results in such scenario, our method archives a good results.
68
Marco Anisetti Methods Sung and Poggio (on subset of database)[46] Rowley et al. [48] Roth et al.* [81] Yang et al. [82] Viola-Jones [52] Schneiderman and Kanade [62] Schneiderman and Kanade (wavelet)*[50] Schneiderman and Kanade (eigenvectors)* [51] Meynet et al. [83]* Ebrahimpour et al. [84] our*
TP rate FP number 81.9% 13 90.1% 167 94.8% 78 86.0% 8 93.90% 167 93% 88 91.8% 110 95.8% 65 92.8% 88 98.8% 10 94.5% 73
Table 3.4 Comparison of different face detection methods on CMU-MIT. * indicate that 5 images of line drawn faces were excluded leaving 125 images with 483 labeled faces.
3.4.1 Discussion and Conclusions The detection approaches presented guarantee the level of performance required for face detection as initialization step in video streaming tracking. In this section we underline the main drawbacks and characteristics of the two presented face detector as comparison. The feature based face detection system is suitable in the case of color video streaming, where skin-map definition is applicable. In fact the skin map approach produces two advantages: i) reduce the searching areas, ii) guarantee a certain posture independence (more on tilt side) in eyes searching process thanks to the non skin blob and validation approach. Generally speaking if for some reasons (e.i. ambient light) the skin-map process dose not produce a reliable skin-map region, the results can be a wide skin region or a too small one. In both two cases the discriminant are the non skin regions belonging to the skin region. In this sense the probability of classifies eyes areas as skin is very low, therefore the only problem is when the skin-map region is to small to contain the eyes region or absent. In this case, an when we deal with grayscale input, the skin-map is unusable. Considering this situation, the feature based approach is not applicable with success. One can start the eyes validation process in entire frame, but this will become more similar to appearance based face detection than to feature based ones. Therefore in the case of non skin available (or rather in the case of grayscale images )the appearance based detection is preferable. Other interesting advantage of feature based one is a more robustness to expression chance, in fact the eyes region still quite stable even while changing in expression occurs. Regarding the appearance based, the searching process is performed over the entire frame, without any preliminary restriction. The results are more dependent on the posture end expression, since populating training set with many face images with too spread posture enlarge the probability of false positive detection. The same discussion is valid considering partial occlusion, feature based approach obviously evidence robustness to partial occlusion of non used features like mouth, while no robustness in the case of eyes occlusion (sunglass etc.). The appearance based de-
3 Fast and robust Face Detection
69
tection thanks to the cascade approach can be tuned to obtain a better results to robustness without impact on the false positive extending the cascade concept with different ad hoc training. For instance an ad hoc training for each type of occlusions following the proposal in [56] about the posture. This occlusion oriented improvement have affect also on feature based approach. In fact the feature based detection includes a validation similar to object detection strategy. On the other hand the validation requires initial location, therefore if the occlusions invalidate location, the aforesaid improved is completely unusable. Considering the computation efficiency, both two approach are comparable even if the appearance based one requires complete frame scan. Concluding, even the two approach computationally similar, the feature oriented detection is preferable in the color video streaming, while appearance based detection is preferable in the grayscale one. The feature based guarantee more efficacy for changing in expression or posture, but since both are in general well temporal defined deformation of rest facial appearance, the appearance detector generally include only a delay in time before detecting the face in a scene.
References 1. Viola, P. and Jones, M.: Robust real-time object detection. Proc. of Second Int. Workshop on Stat. and Comp. Theories of Vision, (2001) 2. Wu, H., Chen, Q., Yachida, M.: Face detection from color images using a fuzzy pattern matching method. IEEE Trans. Pattern Analysis and Machine Intelligence, 21(6), 557–563 (1999) 3. Reisfeld, D., Yeshurun,Y.: Preprocessing of face images: Detection of features and pose normalization. Comput. Vision Image Understanding, 71 (1998) 4. Craw, I., Ellis, H., Lishman, J.: Automatic extraction of face features. Pattern Recognition Letters (1987) 5. Govindaraju, V.: Locating human faces in photographs. Int. J. Comput. Vision, 19 (1996) 6. Herpers, R., Michaelis, M., Lichtenauer, K.-H., Sommer, G.: Edge and keypoint detection in facial regions. Proc. of 2nd IEEE Int. Conf. on Automatic Face and Gesture Recognition (1996) 7. Venkatraman, M., Govindaraju, V.: Zero crossings of a non-orthogonal wavelet transform for object location. IEEE Intl Conf. Image Processing, (1995) 8. Graf, H.P., Cosatto, E., Gibbon, D., Kocheisen, M., Petajan, E.: Multimodal system for locating heads and faces. Proc. Second Intl Conf. Automatic Face and Gesture Recognition, (1996) 9. Yang, G., Huang, T.S.: Human face detection in a complex background. Pattern Recognition, 27 (1994) 10. Kotropoulos, C., Pitas, I.: Rule-based face detection in frontal views. Proc. of Intl Conf. Acoustics, Speech and Signal Processing, (1997) 11. Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: Analysis and comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1) (2005) 12. Kruppa, H., Bauer, M.A., Schiele,B.: Skin patch detection in real-world images. Proc. of Annual Symposium for Pattern Recognition of the (DAGM2002), (2002) 13. Chang, F., Ma, Z., Tian, W.: A region-based skin color detection algorithm. Proc. of PacificAsia Conference on Knowledge Discovery and Data Mining, (2007) 14. Satoh, S., Nakamura, Y., Kanade, T.: Name-it: Naming and detecting faces in news videos. IEEE Multimedia, (1999)
70
Marco Anisetti
15. Brown, D., Craw, I., Lewthwaite, J.: A som based approach to skin detection with application in real time systems. Proc. of the British Machine Vision Conference, (2001) 16. Hsu, R.-L., Abdel-Mottaleb, M., Jain, A.K.: Face detection in color images. IEEE Trans. Pattern Analysis and Machine Intelligence, (2002) 17. Saber, E., Tekalp, A.: Frontal-view face detection and facial feature extraction using color, shape and symmetry based cost functions. Pattern Recognition Letters, (1998) 18. Marques, F., Vilaplana, V.: A morphological approach for segmentation and tracking of human faces. Proc. of International Conference on Pattern Recognition (ICPR00), (2000) 19. Brand, J., Mason, J.: A comparative assessment of three approaches to pixel level human skindetection. Proc. of the International Conference on Pattern Recognition, (2000) 20. Wang, C., Brandstein, M.: Multi-source face tracking with audio and visual data. Proc. of IEEE International Workshop on Multimedia Signal Processing (MMSP), (1999) 21. Fukamachi, H., Terrillon, J.-C., Shirazi, M.N., Akamatsu, S.: Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images. Proc. of the International Conference on Face and Gesture Recognition, (2000) 22. Kovac, J., Peer, P., Solina, F.: Human skin colour clustering for face detection. Proc. of EUROCON 2003 - International Conference on Computer as a Tool, (2003) 23. Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection. Proc. of IEEE Int. conf. on Computer Vision and Pattern Recognition, (1999) 24. Phung, S.L., Bouzerdoum, A., Chai, D.: A novel skin color model in ycbcr color space and its application to human face detection. Proc. of IEEE International Conference on Image Processing (ICIP’2002), (2002) 25. De Silva, L.C., Aizawa, K., Hatori, M.: Detection and tracking of facial features by using a facial feature model and deformable circular template. IEICE Trans. Inform. Systems, (1995) 26. Jeng, S.H., Liao, H.Y.M., Han, C.C., Chern, M.Y., Liu, Y.T.: Facial feature detection using geometrical face model: An efficient approach. Pattern Recognition 31 (19989 27. Huang, W., Mariani, R.: Face detection and precise eyes location. Proc. of 15th International Conference on Pattern Recognition, (2000) 28. Hamouz, M., Kittler, J., Kamarainen, J.-K., K¨alvi¨ainen, H., Paalanen, P.: Affine-invariant face detection and localization using gmm-based feature detector and enhanced appearance model. Proc. of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, (2004) 29. Loy, G., Eklundh, J.-O.: Detecting symmetry and symmetric constellations of features. Proc of European Conference on Computer Vision (ECCV2006), (2006) 30. Maio, D., Maltoni, D.: Real-time face location on gray-scale static images. Pattern Recognition 33 (2000) 31. Craw, I., Tock, D., Bennett, A.: Finding face features. Proc. of Second European Conf. Computer Vision, (1992) 32. Samal, A., Iyengar, P.A.: Human face detection using silhouettes. Intl J. Pattern Recognition and Artificial Intelligence, (1995) 33. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. Proc. of Tenth IEEE International Conference on Computer Vision, (2005) 34. Miao, J., Yin, B., Wang, K., Shen, L., Chen, X.: A hierarchical multiscale and multiangle system for human face detection in a complex background using gravitycenter template. Pattern Recognition 32 (1999) 35. Karlekar, J., Desai, U.B.: Finding faces in color images using wavelet transform. Proc. of the 10th International Conference on Image Analysis and Processing, (1999) 36. Lanitis, A., Taylor, C.J., Cootes, T.F.: Automatic interpretation and coding of face images using flexible models. IEEE Pattern Analysis and Machine Intelligence, 19 (1997) 37. Cootes, T.F., Taylor, C.J.: Locating faces using statistical feature detectors. Proc. of Second Intl Conf. Automatic Face and Gesture Recognition, (1996) 38. Cootes, T., Cooper, D., Taylor, C., Graham, J.: Active shape models their training and application. Computer Vision and Image Understanding, 61(1) (1995)
3 Fast and robust Face Detection
71
39. Cristinacce, D., Cootes, T., Scott, I.: A multi-stage approach to facial feature detection. Proc. of 15 thBritish Machine Vision Conference, (2004) 40. Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. Proc. of the British Machine Vision Conf, (2006) 41. Li, Y., Gong, S., Liddell, H.: Modelling faces dynamically across views and over time. Proc. of Eighth International Conference On Computer Vision ICCV, (2001) 42. Romdhani, S., Gong, S., Psarrou, A.: A generic face appearance model of shape and texture under very large pose variations from profile to profile views. Proc. of International Conference on Pattern Recognition ICPR, (2000) 43. Turk, M., Pentland, A.: Face recognition using eigenfaces. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 44. Moghaddam, B., Pentland, A.: Face recognition using view-based and modular eigenspaces. Proc. of Automatic Systems for the Identification and Inspection of Humans, (1994) 45. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19 (1997) 46. Sung, K., Poggio, T.: Example-based learning for view based human face detection. IEEE Transaction on Pattern Analysis Machines Intelligence,20 (1998) 47. Burel, G., Carel, D.: Detection and localization of faces on digital images. Pattern Recognition Letter, 15 (1994) 48. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transaction on Pattern Analysis Machines Intelligence, 20 (1998) 49. Lin, S.-H., Kung, S.-Y., Lin, L.-J.: Face recognition/detection by probabilistic decision-based neural network. IEEE Transaction on Neural Networks, 8 (1997) 50. Schneiderman, H., Kanade, T.: A statistical model for 3d object detection applied to faces and cars. Proc. of IEEE Conference on Computer Vision and Pattern Recognition, (2000) 51. Schneiderman, H., Kanade, T.: Object detection using the statistics of parts. International Journal of Computer Vision, (2004) 52. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision, (2004) 53. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. Proc. of International Conference on Image Processing ICIP02, (2002) 54. Mita, T., Kaneko, T., Hori, O.: Joint haar-like features for face detection. Proc. of Tenth IEEE International Conference on Computer Vision, ICCV 2005, (2005) 55. Li, S.Z., Zhang, Z.: Floatboost learning and statistical face detection. IEEE Trans. on Pattern Analysis and Machines Intelligence, 26, 1112–1123, September (2004) 56. Huang, C., Ai, H., Li, Y., Lao, S.: High-performance rotation invariant multiview face detection. IEEE Trans. on Pattern Analysis and Machines Intelligence, 2007. 57. Viola, P., Jones, M.: Fast and robust classification using asymmetric adaboost and a detector cascade. Advances in Neural Information Processing System, (2002) 58. Wu, J., Brubaker, S.C., Mullin, M.D., Rehg, J.M.: Fast asymmetric learning for cascade face detection. IEEE Trans. on Pattern Analysis and Machines Intelligence, 30, 369–382, March (2008) 59. Pham, M.-T., Cham, T.-J.: Online learning asymmetric boosted classifiers for object detection. Proc. IEEE Computer Vision and Pattern Recognition (CVPR07), (2007) 60. Brubaker, S.C., Wu, J., Sun, J., Mullin, M.D., Rehg, J.M.: On the design of cascades of boosted ensembles for face detection. International Journal of Computer Vision, (2008) 61. Mikolajczyk, K., Choudhury, R., Schmid, C.: Face detection in a video sequence - a temporal approach. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, USA, Dec. (2001) 62. Schneiderman, H., Kanade, T.: Probabilistic modeling of local appearance and spatial relationships for object recognition. Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR98) (1998) 63. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Eurocolt 95, (1995)
72
Marco Anisetti
64. Papageorgiou, C., Oren, M., Poggio, T.: A general framework for object detection. Proc. of International Conference on Computer Vision, (1998) 65. Anisetti, M., Bellandi, V., Damiani, E., Jeon, G., Jeong, J.: Full controllable face detection system architecture for robotic vision. Proc. of IEEE International Conference on Signal-Image Technology and InternetBased Systems (IEEE SITIS’07), (2007) 66. Anisetti, M. Bellandi, V., Damiani, E., Jeon, G., Jeong, J.: An adaptable architecture for human-robot visual interaction. Proc. of The 33rd Annual Conference of the IEEE Industrial Electronics Society (IECON’2007), (2007) 67. Martinez, A.M., Benavente, R.: The ar face database. Technical Report 24, CVC Technical Report, June (1998). 68. The bioid face database, (2001) 69. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. Proc. of IEEE Int’l Conf. Multmedia and Expo (ICME’05), Amsterdam, The Netherlands, (2005) 70. Kanade, T., Cohn, J., Tian, Y.: Comprehensive database for facial expression analysis. Proc. of 4th IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), pp. 46–53, Grenoble, France (2000) 71. Hammal, Z., Caplier, A., Rombaut, M.: A fusion process based on belief theory for classification of facial basic emotions. Proc. of Fusion’2005 the 8th International Conference on Information fusion (ISIF 2005), (2005) 72. Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning, (2001) 73. Phillips, P.J., Moon, H., Rauss, P., Rizvi, S.A.: The feret evaluation methodology for facerecognition algorithms. Proc. of Computer Vision and Pattern Recognition, pp. 137-143 (1997). 74. Phillips, P.J., Rauss, P.J., Der, S.Z.: Feret (face recognition technology) recognition algorithm development and test results. Technical report, (1996) 75. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust face detection using the hausdorff distance. Proc. of Third International Conference on Audio- and Video-based Biometric Person Authentication, (2001) 76. Kirchberg, K.J., Jesorsky, O., Frischholz, R.W.: Genetic model optimization for hausdorff distance-based face localization. Proc. of International ECCV 2002 Workshop on Biometric Authentication, (2002) 77. Wu, J., Zhou, Z.-H.: Efficient face candidate selector for face detection. Pattern recognition, 36 (2003) 78. Fr¨oba, B., K¨ublbeck, C.: Robust face detection at video frame rate based on edge orientation features. Proc. of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, (2002) 79. Fr¨oba, B., Ernst, A.: Face detection with the modified census transform. Proc. of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, (2004) 80. Ramirez, G.A., Fuentes, O.: Multi-pose face detection with asymmetric haar features. Proc. of IEEE Workshop on Applications of Computer Vision (WACV 2008), (2008) 81. Roth, D., Yang, M.-h., Ahuja, N.: A snow-based face detector. Advances in Neural Information Processing Systems 12, 855–861 (2000) 82. Yang, M.-h., Abuja, N., Kriegman, D.N.: Face detection using mixtures of linear subspaces. Proc. of Fourth IEEE International Conference on Automatic Face and Gesture Recognition, (2000) 83. Meynet, J., Popovici, V., Thiran, J.: Face Detection with Doosted Gaussian Features. Pattern Recognition, 40 (2007) 84. Ebrahimpour, R., Kabir, E., Yousefi, M.R.: Face detection using mixture of mlp experts. Neural Processing Letters, 26 (2007)
Chapter 4
Automatic 3D Facial Fitting for Tracking in Video Sequence Valerio Bellandi
Summary. This chapter presents the 3D face model initialization procedure. The major contribution is the innovative coarse-to-fine 3D initialization strategy proposed that deals with common problems of errors in feature selection process and with a lack of features points. The proposed approach is based over 3D deformable tracking system. We also propose a deep adaptation strategy while tracking for dealing with difference between real face mesh and the model ones. The 3D refinement proposed (also the deep one) are robust to initial light situation and expression. Key words: 3D Face, Model Fitting, Face Tracking
4.1 The 3D Face Model A well-known principle of computer vision states that the closer a model is to the real object, the more precise model-based tracking will be. Considering a face model this assumption acquires more significants: i) Model precision in the sense of resolution (number of triangle for area), ii) Precision as subject-based adaptability. In fact the face’s shape of individuals exhibits a great deal of variation in their appearance, even if they all still have a good deal of structure in common. A model with great precision in resolution but without adaptability, even could be useful in some object tracking approach (i.e. rigid object) ,is not useful in facial modeling. In general it is preferable to have a simpler but adaptable model. Several advanced face models have been created by different research groups and companies, sometimes using thousands of polygons to describe a face, and sometimes with complex underlying physical and anatomical models. Despite this, the Candide model is still widely used, since its simplicity makes it a good tool for image analysis tasks and low comValerio Bellandi University of Milan, Department of Information Technology, Via bramante 65, 26013 Crema Italy e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_4, © Springer Science + Business Media, LLC 2009
74
Valerio Bellandi
plexity animation. The original Candide [1] contained 75 vertices and 100 triangles. This version is rarely used. The most widespread version, the ”‘de facto”’ standard CANDIDE model, is a slightly modified model with 79 vertices, 108 surfaces and 11 Action Units. In our previous works [2] [3] we chose a Candide-3 model for our tests, obtaining good results for a simple static tracking. This model is an extended CANDIDE model by Ahlberg including 113 vertices and 168 surfaces. This model separate shape from animation. Shape Unit (SU) defines a deformation of a standard face towards a specific face. The Shape Parameters thus describes the static shape of a face, while the Animation Unit Vectors (AUV) Parameters describes the dynamic shape. Shape Parameters are invariant over time, but specific to each individual. Animation Parameters naturally varies over time, but can be used for animating different faces. Here, we shall use a modified version of Candide-3 including a modified version (Expression Unit vectors EU) of Candide-3 AUV (Animation Unit Vectors) and some adaptation on Shape Unit vectors SU. Our extended Candide-3 parameterized 3D face model encode the face shape, its expression and its appearance and illumination in the scene. Our model’s main features can be summarized as follows: 1. Triangle-based wire frame. This characteristics makes the model well suited for affine transformation, i.e. ones that maps triangles into triangles. It is also useful for expression morphing, shape modifications and illumination variations. 2. Expression and Shape parametrization. Shape parameters describe unchanging features of an observed face and capture variations in shape across face in a the human population. For expression parameters we refer to every type of nonrigid muscular movement related to expression change similar to animation like AUV of Candide-3 that we call Expression Unit vectors (EU). This separation produces an easier tracking problem by requiring a smaller description of object state to be estimated in each frame. Shape and Expression of a face model g can be described by a simple linear formula: g = g + Sσ + Eα
(4.1)
where the vector g is the coordinates of the neutral shape model’s vertices in rest position. Also, S and E are respectively the Shape and Expression Unit vectors. Shape vectors, are controlled by the σ parameter while Expression vectors are controlled by the α parameter. 3. Parametrized texture. The texture associated to our model is defined by linear combination of appearance and illumination parameters. This feature of our model is used principally during initialization procedure where 3D model is adapted over detected face in the scene. Equation 4.1 is related to the triangular based model and defines the face’ Shape and Expression using S and E vectors that morph the initial vertices of every facets of model g in accordance with σ and α parameters. For optimization purpose we also define a point based model associated with the triangular one. This point based model is called Template, and the vectors that act as Shape or Expression but in the point domain are called basis. We use the same terminology for the component of appearance and illumination since they are also in the points domain. Therefore our
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
75
model Template is a point based 3D face model able to morph in accordance with Shape and Expression basis and also able to change its appearance using appearance and illumination basis. Our model presents two different type of basis: i) Morphing basis (shape or expression), ii) Appearance basis (texture or illumination). The first type regards directly the 3D model conformation, both shape and expression are based over the same basis description (Section 4.2) while the second regards the appearance, or rater, the facial texture as appear in the image plane separating the illumination (defined using morphing basis approach) from the appearance (defined using a training set of faces images).
4.2 3D morphing basis One of the main features of a parametrized 3D facial model is that it can evolve i.e according to expression changes. Our tracking process is based on the idea of separating straightforward rigid face motion from more complex expression or shape morphing. Morphing basis regard this type of Template’s motions. We define morphing basis in such a way that permits an efficient inclusion in our tracking algorithm. We consider our morphing as a well localized process producing deformation in a specific part of the face. This assumption has supported by FACS [4] definition of AU, and also reasonable for shape morphing. This way, morphing can be easily integrated with our warping function1 , aimed at replicate the motion (rigid and non rigid) of tracked face. In order to include the morphing motion into a rigid 3D face motion model in the point domain, we modify Candide-3 morphing vectors made by exploiting the triangle-based wireframe of 3D model. Similar to 4.1 our aim is to obtain an equation that can express linearly the changes to our 3D Template T starting from a neutral (or in the rest position) Template T0 . Specifically, we use the following formula: n
T = T0 + ∑ αi Bi
(4.2)
i=1
Where n is the number of morphing basis taken into account. Now, any complex deformation can be decomposed into a set of simpler morphing motions expressed in terms of basis element Bi . We know that the final position of the vertices of every triangle in the 3D model can be computed as follows: N
Vf = Vi + ∑ α j A j
(4.3)
j=1
where Vf is a matrix containing the coordinates of the vertices in their final positions, Vi the matrix for initial positions, α j are parameters specifying which movements of the N possible ones are actually done and A j is the matrix containing the 1
Function that include 3D roto-traslation.
76
Valerio Bellandi
components for j-morphing. Note that A is a sparse matrix whose nonzero values on a column j correspond to triangle vertices interested by the corresponding jmovement. Considering the three vertices of the k-th triangle, we can express the transformation that maps a point belonging to the original triangle to a point on the corresponding target one in this way: Vi1k V f 1k V f 2k = Vi2k Mk (4.4) Vi3k V f 3k WhereVi1k is the first vertex of the initial k-th triangle. The transformation for each triangle can be written as follows: −1 V f 1k Vi1k Mk = Vi2k V f 2k V f 3k Vi3k −1 Vi1k N = I + Vi2k ( ∑ α j A jk ) j=1 Vi3k
N
˜ jk = I + ∑ α jB
(4.5)
j=1
Where A jk is the matrix expressing the displacement corresponding to j-morph for the k-th triple of points, and B˜ jk is the transformation matrix for the j-morph and k-th triple. Every point i of template Tk , or rather the part of template which belongs to the k-th triangle can be derived from template T0k as follows: Tik = T0ik Mk
(4.6) N
˜ jik = T0ik + ∑ α j T0ik B
(4.7)
j=1 N
= T0ik + ∑ α j B jik
(4.8)
j=1
Considering every triangle k of each j-morphing we obtain the basis for formula (4.2). In fact B jk is the k vector of j component for the basis B 2 . Now, we are able to define any expression or identity deformation as weighted sum of basis components. This characteristic of our model will be exploited in the tracking phase, making our algorithm capable of estimating the α parameters directly. Figure 4.1 shows some examples of expression related vectors. The approach described so far does not enforce any kind of intrinsic morph constraints. For instance, the mouth could be more open than the lower jaw bone would 2 Note that by the definition of our basis, every face morphing can be obtained by specifying a unique α vector.
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
77
Fig. 4.1 Expression vectors examples. A comparison of neutral face wireframe mesh and the same mesh after a morphing vectors application (with an unit magnitude) focusing on triangles involved in the deformation. In black the final face conformation after morphing, in red dot line the initial neutral face state. For a shake of clarification we report some example of singular vector basis effects for left to right: inner brows riser, brows lower and jaw drop.
allow or face shape hardly violate some common anthropometry low . Also, all expression or identification motions are expressed as linear combinations of basic motions. This is of course an approximation, even when using complex and realistic motion basis. This approximation works well for little movement, but in case of great movement, the error introduced can be not negligible. Fortunately these greatest movement is very rare in facial expression or shape adaptationand, when occurs, the general degradation o motion tracking is neglectable.
4.2.1 3D morphing basis for shape and expression The morphability of our face model enables tracking nonrigid face motions(such as the ones used for talking heads, etc.). In this Section, we describe how our model can morphs according to each individual shape and track time-variant face characteristics, like expression changes. In particular our aims are: i) obtain the best fitting 3D model using shape morphing, ii) obtain expression parameters that can be codified in Facial Action Coding System (FACS) system with Action Unit (AU) coding.
4.2.1.1 Shape Unit As already mention, 3D face model’ Shape is a characteristic related to each individual facial conformation (non time-variant). Therefore the shape adaptation is
78
Valerio Bellandi
performed one time as initialization of 3D model. Table 4.1 summarizes the Shape Unit Vectors used for facial shape adaptation. Shape Shape deformation description SU1 Head height SU2 Eyebrows vertical position SU3 Eyes hight position Y values SU4 Eyes width SU5 Eyes height SU6 Eyes separation SU7 Cheek bone z-extension SU8 Nose z-extension SU9 Nose vertical position SU10 Point of nose hight SU11 Mouth hight SU12 Mouth large SU13 Eyes vertically not in axis SU14 Jaw dimension SU15 Lips vertical dimension SU16 Eyes z-extension SU17 Nose hump z-extension SU18 Chin z-extension Table 4.1 SU list, including some classical Candide-3 SU and the new defined in our model (13,14,15,16,17,18). In blue the SU used for shape estimation with only one frontal view image; in black the SU with impact only on the deep dimension, therefore not estimable with one image.
Some SU are the same as defined by Candide-3 model except from the SU 13,14,15,16,17,18. More in details the shapes presented in blue in table 4.1 are efficiently used for estimating 3D shape using only one image and some fiducial points for further details on the initialization procedure. These strategy include some drawback starting from the automatization of fiducial points selection 3 . Other drawback is related to the fact that the real 3D shape can not be captured using only one shot images. In theory a couple of images taken with a stereo camera, or a couple of images with correlation between several fiducial points, is needed for 3D reconstruction mainly for capturing the ”deep” dimension. The 3D adaptation using one image regards only the correspondence between 3D points and the points in the image plane. Therefore the adaptation is simply an approximation that strongly depend over the SU define by the 3D model. This simplification works well for facial posture near frontal view, but is not so accurate otherwise4 . For that reasons we slight extend the set of Candide-3 SU including further SU (16, 17, and 18) vectors focused on deep adaptation (see table 4.1). Figure 4.2 shows an example of shape adaptation using the newer SU vectors The newer deep oriented SU ( some of them illustrated in Figure 4.2) are used to overcome the limitation of monocular-based shape adaptation. In general the adap3 Many points are very difficult to detect with the great precision required specially at the bounding points that define the facial contour. 4 Of course the accuracy is evaluated in comparison with the global accuracy of the 3D model.
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
79
Fig. 4.2 Deep related SU examples for SU8-SU17 on the left and SU16-SU18 on the right. In black the final face conformation after morphing, in red the initial neutral face state
tation of SUs (the blue ones in table 4.1) is performed over the first frame or with a fewer set of frames, regarding the deep related shape (black ones in table 4.1), the adaptation is performed only when the posture related to camera permits to inference some information about the deep dimension (mainly if the posture is different than frontal). Concluding Shape adaptation can be performed in two times: i) SU adaptation on the first frame, ii) deep oriented SU adaptation during tracking only when posture allow it (posture different than near to frontal).
4.2.1.2 Expression Unit Regarding the expression morphing or rather the facial animation, we start by observing that face skin is a complex system whose deformation is regulated by several muscles, a bone system and the skin’s own elasticity. Obviously, the linearized deformation we introduced in Section 4.2, is not accurate enough to represent the behavior of such a complex system. Nevertheless the error introduced can be compensated while tracking. We follow the idea under the AUV definition of Candide-3 that is inspired to Face Coding System of Ekman [4]. Facial Action Coding System (FACS) is the most widely used and versatile method for measuring and describing facial behaviors. Paul Ekman and W.V. Friesen developed the FACS by determining how the contraction of each facial muscle (singly and in combination with other muscles) changes the appearance of the face. They examined videotapes of facial behavior to identify the specific changes that occurred with muscular contractions and how best to differentiate one from another. Their goal was to create a reliable means for skilled human scorers to determine the category or categories in which to fit each facial behavior. FACS measurement units are not muscles but Action Units (AUs), for two reasons: i) for a few appearances, more than one muscle was combined into a single AU because the
80
Valerio Bellandi
changes in appearance they produced could not be distinguished, ii) the appearance changes produced by one muscle were sometimes separated into two or more AUs to represent relatively independent actions of different parts of the muscle. A FACS coder ”dissects” an observed expression, decomposing it into the specific AUs that produced the movement. The AUV is the corresponding implementation (of one or more Action Units) in the CANDIDE model. For example in Candide-3, the Action Units 42 (slit), 43 (eyes closed), 44 (squint), and 45 (blink) are all implemented by AUV 6 In our case for a shake of cleareness we use the same numeration as FACS AU. Of course some of our EUs (Expression Units) can be associated to more Ekman AUs, for example our EU20 (lip stretcher) codifies the same Ekman AU20 but also the opposite Ekman AU18 (lip puckered) since it is only the opposite morphing (i.e. simple change of sign in the parameter that act on the Expression Unit). Further example, our EU25 (lips part) the is related to AU25, AU26 (jaw drop), and AU27 (mouth stretch) depending on the intensity of the motion. So, the Ekman AU is compacted into a set of Expression Units, each one with a different range of validity. Our aim is to define EU as closer as possible to Ekman AU. Considering the nature of our basis element, each α parameter extracted during tracking only depends on the motion intensity. Table (4.2) shows the physical motion linked to our expression unit represented by each EU parameters5 . Even if the EU is defined to be as close as possible to AU definition, to have strict Ekman AU classification the EU parameters are not enough (the same for AUV of Candide Model) some relation need to be formulated. This is in general performed independently at evaluation time. alpha muscular movement AUs EU1 inner brows raiser AU1, AU4 EU2 outer brows raiser AU2, AU4 EU4 brow lower AU4 EU9 nose wrinkle AU9 EU10 upper lip raiser AU10 EU12 lip corner pull AU12, AU15 EU20 lip stretcher AU20, AU18 EU23 Lip tightener AU23 EU25 lips part AU25, AU26, AU27 Table 4.2 Link between EU and muscular movement and AU. A positive α applied to EU follow the direction of the described movement and negative move in opposite.
Concluding our EU basis permit to extract some AU related parameters. Of course, many other AUs related to head motion are extracted too; also the omitted eyes related AUs can be extracted using local eyes state analysis algorithm described in [2].
5
Some eyes related AU (5 and 7,from 41 to 46) are omitted since very difficult to evaluate in profitable way in a tracking process
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
81
Fig. 4.3 3D Face model in neutral position (first image), with the eyebrow raised, with the opened mouth and with a composition of various expressions.
4.3 Appearance basis Other important characteristic related to 3D facial model is the texture. As already mention the 3D conformation of face is characterized by a great extra class variability (shape difference between subjects) and intra class variability (muscular movement). In a similar way the face texture presents a great variability. Our aim is to parametrize the variability of the face texture obtaining a linear reconstruction. We follow the approach of AAM/MM computing the texture appearance parameters by using Principal Component Analysis (PCA) based techniques on a face database properly normalized. Principal component analysis (PCA) is a vector space transform that is mathematically defines as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In theory PCA is the optimum transform of given data in least square terms.
4.3.0.3 PCA Generally the definition of PCA is presented using singular value decomposition (SVD) approach. Considering data matrix X with M rows and N columns, where M are the number of the observed variables and N the number of repetition of the experiment. Each column of the data matrix X is a data column vectors x1 . . . xN with each xn representing a single grouped observation of the M variables. The PCA transformation of zero mean (the empirical mean of the distribution has been subtracted from the data set) data matrix X T is given by: YT = XT W = V where V ΣW T is the singular value decomposition of X T . Other largely used approach is called the covariance method. The goal is to transform the data set X of dimension M to an alternative data set Y of smaller dimension L. This is the same as seeking to find the matrix Y , where Y is the KarhunenLoeve transform (KLT) of matrix X. Following the covariance method, the empirical
82
Valerio Bellandi
mean is computed, finding the empirical mean along each dimension m = 1...M : u[m] = N1 ∑Nn=1 X[m, n] where u is the empirical mean vector of dimensions M × 1. This step is the same described above for the svd approach. After that before calculating the covariance matrix a deviations from the mean is calculated by centering the data as follows: B = X − uh where h is a 1 × N row vector of all one’s: h[n] = 1 for n = 1 . . . N. B is the mean-subtracted M × N matrix. The empirical covariance matrix C M × M is calculated from the outer product of matrix B with itself: C = E [B ⊗ B] = E [B · B∗ ] = N1 B · B∗ where E is the expected value, ⊗ is the outer product operator, and * is the conjugate transpose operator. Note that if B consists entirely of real numbers, the ”conjugate transpose” is the same as the regular transpose. Now it comes possible to compute the matrix V of eigenvectors which diagonalizes the covariance matrix C: V−1 CV = D where D is the diagonal matrix of eigenvalues of C. The Matrix D will take the form of an M × for p = q = m is the mth eigenvalue M diagonal matrix, where: D[p, q] = λm of the covariance matrix C, and D[p, q] = 0 for p 6= q.. Matrix V , also of dimension M × M, contains M column vectors, each of length M, which represent the M eigenvectors of the covariance matrix C. The eigenvalues and eigenvectors are ordered and paired. The mth eigenvalue corresponds to the mth eigenvector. The eigenvalues represent the distribution of the source data’s energy among each of the eigenvectors, where the eigenvectors form a basis for the data. To obtain the principal component, sort the columns of the eigenvector matrix V and eigenvalue matrix D in order of decreasing eigenvalue and take the first eigenvectors.
4.3.0.4 Image-based PCA Mathematically, PCA approach applied to the images, will treat every image of the training set as a vector in a very high dimensional space. The eigenvectors of the covariance matrix of these vectors would incorporate the variation amongst the face images. Now each image in the training set would have its contribution to the eigenvectors. The high dimensional space with all the eigenfaces is called the image space (feature space). We use a grayscale texture thus a face image I(x, y) is a two dimensional N by N array of intensity values. An image may also be considered as a vector of dimension N 2 (i.e. 512 by 512 becomes a vector of dimension 262144). An ensemble of images, then, maps to a collection of points in this huge space. Principal component analysis would find the vectors that best account for the distribution of the face images within this entire space. In the rest of dissertation we use the covariance method approach for PCA definition. Let the training set of M face images be T1 , T2 , T3 , · · · TM . This training set has to be mean centered before calculating the covariance matrix or eigenvectors. The average face is calculated as Tavg = M1 ∑M i=1 Ti. A typical appearance of the average face is shown in Figure 4.5. Thus the vector φ = Ti − Tavg describe the difference between each image i in the data set from the average face. The covariance matrix is:
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
83
Fig. 4.4 Example of eigenfaces appearances.
Fig. 4.5 Average face in canonical eigenface approach.
C=
1 M ∑ φi φiT M i=1
(4.9)
= AAT (4.10) where A = [φ1 , φ2 , · · · φM ] The eigenvalues δ are choosed such that
84
Valerio Bellandi
δk =
1 M T 2 ∑ (uk φi ) M i=1
(4.11)
is the maximum, where: uTl uk =
1 if l=k 0 otherwise
(4.12)
These formulas attempt to capture the source of variance. The vectors uk are refered as the ”eigenfaces”, since they are eigenvectors and appear face-like in appearance. Figure 4.4 shows the appearance of these eigenfaces that look like ghostly images. In each eigenface some sort of facial variation can be seen which deviates from the original image. The matrix C is a N 2 x N 2 matrix and would generate N 2 eigenvectors and eigenvalues. With image sizes like 512 by 512, or even lower than that, such a calculation would be impractical to implement. A computationally feasible method was suggested by [5] to find out the eigenvectors. If the number of images in the training set is less than the number of pixels in an image (i.e M < N 2 ), then we can solve an M by M matrix instead of solving a N 2 by N 2 matrix. Consider the covariance matrix as AT A instead of AAT . Now the eigenvector vi can calculated as follows, AT Avi = µi vi
(4.13)
where µi is the eigenvalue. Here the size of covariance matrix would be M by M. Thus we can have m eigenvectors instead of N 2 . Premultipying equation 4.13 by A, we have (4.14) AAT Avi = µi Avi from which Avi are the eigenvectors of C = AAT . Follow [5] a matrix M by M L = AT A is created, where Lmn = φmT φn . Form here we can find the M eigenvectors vl of L. In fact a smaller matrix L yield a smaller number of eigenvectors. The eigenfaces are: M
ul =
∑ vlk φk
(4.15)
k=1
With this analysis the calculations are gradely reduced, from the order of the number of the pixel in the images N 2 to the order of the number of the images in the training set. In practice with a training set relatively small (M << N 2 ) the calculations become manageable. For our purpose the accurate reconstruction of the face ′ is not required, we can now reduce the dimensionality to M instead of M. This is done by selecting the M ′ eigenfaces which have the largest associated eigenvalues. In this manner the number of the parameters required for face reconstruction are bounded to a fewer subset nevertheless the quality of reconstruction is suitable for ′ our purpose. These eigenfaces now span a M − dimensional subspace instead of 2 N . Finally we obtain a techniques to linearize a texture description starting form a limited numbers of image basis ui (eigenfaces) and parameters δi (eigenvalues).
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
85
Considering Tapp as the appearance template of our face, we can describe this appearance as linear combination of eigenface: G
Tapp = Tavg + ∑ ul δl
(4.16)
l=1
To create this parametric texture model follow the eigenface approach we first need to construct the face space. This space suffers of some problem that sensibly degrade the quality of the results. Two of the main issues are: 1. Pixel correlation between each face image in the training set. This is at the base for face space creation. It can be assimilated to the concept of alignment. Figure 4.5 shows a typical problem of absence of correlation in the faces image space (more visible in the eyes regions). This concept of alignment can be more complicated considering the 3D nature of face. In fact in many cases the posture of subject used for the image space creation is not the same (e.i strictly frontal). Furthermore each individual different shape in face conformation made the perfect alignment a difficult issue. In some cases a further tangle is the expression of the faces used for the image space creation, that is not uniform for all faces (e.i. every face in neutral expression). 2. Illumination normalization and absence of occlusions. Since the eigenface relies on the pixel intensity, every outliers that modify this intensity due illumination or non face object occlusion, produce a negative effect on face space creation. Some literature works try to solve these issue performing some normalization. For instance the AAM [6] approach uses a piecewise affine warp over a normalized shape. In the case of Morphable Models [7] the normalization is done by construction of 3D model correspondence. In our previous work [8] we propose to address these eigenface’s lacks using a tracking based technique. The main concept is that using a 3D model as defined in Section 4.1 and the illumination model of the following Section 4.4, we can absolve the lacks of eigenface approach. In fact the pixel correlation thanks to the 3D model (assuming a perfect model fit) is guarantee, and the illumination, using 3D model and illumination model of Section 4.4, is compensated. Using 3D model we are able to normalize across posture, shape and expression. Furthermore using the 3D model we inference an approximation of illumination condition and normalize the final texture. For the shake of conciseness we refer to our work [8] for a deeper description of normalization process. These approach for appearance normalization seems more reliable then the AAM ones since takes into account real posture normalization instead of 2D triangle normalization. Regarding the occlusions, many low level analysis techniques can be used to define the occlusion region. In our previous work [8] we investigate the problem of occlusions using eigenface for face identification. Since we need to create a face based texture, at this stage we can avoid occlusion simply discard the face images that include occlusions. In recent work [9] they decide to include the occlusion in the training set and use a PCA with missing data approach. This is mainly due the fact that the AAM approach uses the texture appearance during the track-
86
Valerio Bellandi
ing process and for correction purposes. In our case the appearance is used only for initialization purposes leaving the tracking based on real texture acquired after initialization. Therefore we demand the occlusion management to algorithmic part of the thesis leaving the model free of any further training dependency.
(a) Cropped faces
(b) Normalized with 3D Model
Fig. 4.6 Cropped vs. normalized face images: two examples top from MMI database [10], the bottom from Hammal Chaplier database [11]
Our image space is created by using a set of occlusion-free face images cropped and normalized using our 3D face model. Back projection from images to 3D model and then back to image plane produces our face image for eigenspace. For better dealing with the lack of information about the lateral part of face (with respect with the frontal pose) due the posture, we use a cylindrical projection for back project from the image to the 3D Model. In this manner we avoid the typical lack of appearance information in the texture related to the lateral part of face. Figure 4.6 shows some example of simple face normalization using 3D model. The 3D model approach for face images normalization produces an efficient eigenspace as discuss in [8] that permits to describe the face texture using a few number of parameters (eigenvectors). Therefore the space reduction is more efficient that in the case of normal eigenspace approach. Figure 4.7 shows some principal eigenfaces and the average face. It is easy to note that the non-correlation artifact are absent in this eigenspace (more visible in the average face). In this manner the
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
87
eigenface space regards effectively the changing in appearance minimizing the presence of noise.
(a) faces
Eigen
(b) Average Face
Fig. 4.7 Normalized average face and principal eigenfaces.
As already mention, the eigenface approach can be use to reduce the number of the parameter for reconstruct one face inside the face space. Figure 4.8 shows the eigenvalues associated to each eigenvectors ordered from the biggest. It can be easily seen that in case of normalization (blue line) the strength is more concentrated in few eigenvectors, therefore the same quality of information can be obtained with less eigenvectors. In particular the difference is not so hight but on the other hand significative (5 eigenvectors on normalize are approximately equal to 11 in non-normalized). The main reason of not having a great difference as expected by minimization reside on the fact that the face used for training are already quite normalized in pose and size. From our experiments we decide to use a small subset of eigenvectors for represent the face appearance since only few eigenvectors contain greatest part of the information for face appearance reconstruction. Concluding our 3D Model includes a set of basis component for appearance: the average face template Tavg and 5 principal component template Tapp,1·5 .
4.4 3D Illumination basis Variability in lighting has a large effect on the appearance of objects in images, specially object with complex shape like faces. Figure 4.9 shows and example of this appearance changing. Tolerating abrupt light changes is a big challenge for most face recognition techniques.
88
Valerio Bellandi
Fig. 4.8 PCA histogram of normalized (blue line) vs. non normalized (red line) face.
Many effects need to be taken into account starting from the type of light, the distance and position in the space. Other important aspects are related to the skin reflectance [12] (than can be alternated by the presence of makeup for instances). In this thesis we simplify the light impact (confined by construction since the entire tracking process works in grayscale domain) considering the other effects as noise to be managed. Following [13] , assuming a Lambertian surface in the absence of self-shadowing, all the images of the same surface under different lighting conditions lie in a threedimensional linear subspace of the space of all possible images of the object. As already mentioned in our case, unfortunately, the surface is not truly Lambertian while the self-shadowing problem is managed thanks to 3D face’s model. Some interesting works [14] and [15] successfully simplify the faces illumination subspace using a linear combination of illumination images basis defined by a training set of face’s texture under different illumination condition. Our approach relies on a specific set of bases build over the 3D model in order to compensate the effects of variations in illumination. When a light source is ”distant” enough from a face, all the points on the face will share the same orientation with respect to the direction of the light source. In other words, the same light intensity will be reflected back to the viewer from all points of the face. Lighting, may come from multiple sources, including diffuse sources such as natural ambient light (sky outside). We can therefore describe the intensity of the light as a single function of its direction that does not depend on the position in the scene. We can thus compute an approximation of overall light intensity of that face, depending on the intensity and direction of the light source as well as on the intensity of ambient light, and store it into a matrix. For Lambertian surfaces6 , even multiple (distant) 6 A Lambertian surface [16] is a surface having perfectly matte properties. Lambert surfaces strictly adhere to Lamberts cosine law, stating that reflected or transmitted luminous intensity in any direction varies as the cosine of the angle between that direction and the normal vector to the surface.
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
89
Fig. 4.9 Three principal illumination basis.
light sources can be handled easily: it is just a matter of adding up their individual contributions to find the appearance of a certain surface under defined light [18]. In the Lambertian case, the given three bases correspond to three linearly independent light source directions, and one can always derive the image of the surface under a new light direction as a linear combination of these three basis Figure 4.9. Namely, the irradiance at a point x is given by I = αnL
(4.17)
where n is the normal vector to the surface,α is the albedo coefficient and L is the power and the direction of incident light rays. So, our computation of intensity involves three steps. • Firstly we compute the normal direction to each face. • Secondly, we compute the value of the normal vector to each vertex j as the weighted sum of the normal to all triangles sharing vertex j, as follows: Vj =
∑i φi j ni ∑i φi j
(4.18)
where φi j is the angle on the jvertex on the triangle i, and ni is the normal to the i triangle. In this manner we obtain a smooth normal vector basis (similar to Gouraud shading). In fact considering our triangle based model if we use the normal to each triangle instead of a weight normal over the vertex, we will obtain illumination model with discontinuity over each triangle edge. Following the motion basis construction, starting from vertex values we created a matrix associating every point of the template to its normal vector. • Thirdly, we build the illumination basis using the Phong lighting technique, similar as we did for the expression and shape ones. Calling Bx , By and Bz the vectors of each direction component, the new template appearance T can be written starting from the previous template Tapp as follows:
As a consequence, the luminance of a Lambertian surface is the same regardless of the viewing angle [17].
90
Valerio Bellandi 3
T = Tapp + ∑ λl (Bx cos(θl + rotx ) + l=1
+ By cos(φl + roty ) + Bz cos(ψl + rotz )) (4.19) where θl , φl , ψl are the directions of every l t h light and rotx , roty , rotz are the estimations of the rotation of the template. The λl parameters are the intensity of each light, which will be estimated by the tracking algorithm described in the following Section. Of course this process is only an approximation aims to minimize the influence of light in a scene during a tracking process. This set of normal vector bases turn out to be useful for managing hidden triangles in the tracking algorithm and self shadowing. The hidden triangle occlusion is usually classifies as self occlusion. Thanks to them, we can know which facets are not visible anymore and stop analyze them.
4.5 The General Purposes 3D Tracking Algorithm We shall now see how the morphing and illumination bases described in the previous Sections can be used to compute posture estimation, morphing deformation and illumination condition parameters in a single minimization process. For the sake of clarity, we shall first describe the steepest descent algorithm for 3D posture and morphing estimation. We rely on the idea that a 2D face template Ti (x), (extracted by projecting 3D Template T on image plane) always appears in next frame I(x) albeit warped by W (x; p), where p = (p1 , . . . , pn , α1 , . . . , αm ) is vector of parameters for 3D face model with m Candide-3 like animation units (AUV) movement parameters and x are pixel coordinate from image plain. Therefore, we can obtain movement and expression parameter p by minimizing function (4.20). In fact, if Ti (x) is the template at time t with the correct pose and expression p and I(x) is the frame at time t + 1, assuming illumination not to change much, the next correct pose and expression p at time t + 1 can be obtained by minimizing the sum of squared errors between T (x) and I(W (x; p)), as follows: 2 (4.20) [I(W (x; p)) − T (x)] ∑ x
For this minimization we use a forward additive implementation approach like the one presented in [19]. Namely, our technique assumes an estimate of p to be known and iteratively solves for increments to parameters ∆ p. Equation (4.20) after some easy passages becomes:
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
∂W T ∆ p = H −1 ∑ ∇I [T (x) − I(W (x; p))] ∂p x ∂W T ∂W H = ∑ ∇I ∇I ∂p ∂p x
91
(4.21)
with ∇Iis the image gradient of I evaluated at W (x; p), ∂∂Wp is Jacobian of warp and ∆ p is the incremental warp parameters. In order to recover the 3D posture and expression morphing parameters, we consider that the motion of head point X = [x, y, z, 1]T between time t and t + 1 is: X(t + 1) = M · X(t) and expression morphing of the same point is: m
X(t + 1) = (X(t) + ∑ (αi · Bi ))
(4.22)
i=1
where αi and Bi are represented as explained in the previous Section and matrix M is a roto-translation matrix where rotation is represented as a Fick-style matrix7 . This way, motion parameters p become (ωx , ωy , ωz , tx ,ty ,tz , α1 , . . . , αm ) The novelty of this approach is that our warping W (x; p) include both roto-translation parameters and morphing basis. The warping in (4.20) becomes: m
W (x; p) = M(X + ∑ (αi · Bi ))
(4.23)
i=1
In a situation of perspective projection, assuming the camera projection matrix depends only on the focal length fL , the image plane coordinate vector x is computed using the following projection matrix:
x(sy cz ) + y(sx sy cz − cx sz ) + z(cx sy cz − sx sz ) + tx + Bx x(cy sz ) + y(sx sy sz − cx cz ) + z(cx sy sz − sx cz ) + ty + By fL · (t) x(−sy ) + y(sx cy ) + z(cx cy ) + tz + Bz
(4.24)
Considering sx cx that represents respectively sinωx , cosωx etc. Bx = ∑m i=1 (αi (ai (sy cz ) + bi (sx sy cz − cx sz ) + ci (cx sy cz − sx sz ))) B y = ∑m i=1 (αi (ai (cy sz ) + bi (sx sy sz − cx cz ) − ci (cx sy sz − sx cz ))) Bz = ∑m i=1 (αi (ai (−sy ) + bi (sx cy ) + ci (cx cy )))
(4.25)
Where a, b, c represent respectively the x, y, z component of vector element of the basis B. Summarizing our technique, we add the basis components to each 3D model point and multiply them by αi , representing the estimated intensity for that basis. This function jointly maps 3D motion and morphing to the image plane. Then, using a forward additive estimate approach, we obtain the correct 3D mo7
In our previous work [20] we used a Bregler-Murrey matrix, but later found out that in critical applications the deformation error introduced with this approximation may be not negligible.
92
Valerio Bellandi
tion posture and morphing parameters of the template between two frames in a single minimization phase. Although faster implementations of this minimization strategy can in principle be devised (e.g. using the inverse approach described in [19]), our implementation reaches an acceptable frame rate even in a preliminary R Unfortunately, the complexity of our warping techniques Matlab implementation. makes it hard to reach a satisfactory precision with this minimization strategy. In order to improve robustness with respect to global and local illumination changes, we introduced five additional parameters in our minimization algorithm, using a LAV-like approach [19]. Namely, we consider the image template T (x) as: 5
T (x) + ∑ λi Li (x)
(4.26)
i=1
where Li with i = 1, . . . 5 is a set of known appearance variation images and λi with i = 1, . . . 5 are the appearance parameters. Global illumination changes can them be modeled as arbitrary changes in gain and bias between the template and the input image by setting L1 to be the T template and L2 to be the unitary ”all-in-one” image. This approach has been widely used, however without achieving a sufficiently high level of robustness with reference to illumination change. In order to improve it, we again exploit our 3D model to compute lateral illumination by using the other illumination elements of our basis Li i = 3, . . . 5. Differently from what we do in the morphing case, here we do not use each vector to produce a movement but only to obtain a value that represents the illumination level. Using the equations 4.26 instead of the T (x) in 4.20 we obtain the following expression to be minimized: ! 5
min
∑[I(W (x; p)) − T (x) − ∑ λi Li (x)]2 x
(4.27)
i=1
Minimization is then achieved using the steepest descent approach. In Figure (4.10) there are some examples of tracking experiments with illumination changes in realistic environment with a standard low quality web-cam. We shall illustrate the improvement in accuracy of posture and expression estimation in Section 4.7.
4.5.1 Feature Location Feature location process aims to identify some point-like features with a direct correspondence to our 3D model, in particular to the vertex of the triangular 3D model. Since we have already define a face region with one of the face detection strategy aforesaid the feature detection process takes advantage of several anthropometric measures that constrain the features searching process. Furthermore in the case of feature oriented face detection, the eyes searching region is already well bounded. Our feature location approach localizes eyebrows, eyes and mouth related feature (in general the angles). Our approach is similar for every feature: i) detect the fea-
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
93
Fig. 4.10 Example of tracking during extreme illumination condition variation and with variation in pose and expression. In black the tracking mask.
ture with object detection approach (ad hoc training for each features) constraining the searching area with anthropometric measure or previous information, ii) low level driven analysis for points detection. For the shake of conciseness we report the approach used for eyes features, other features use a similar approach. Considering the detected face fa , the eyes searching process will act on the upper half of the facial area. We use two specular approach for left and right eye. Therefore the upper facial part is furthermore divided in two region one for left eye searching and the other for the right one. The detected process used to locate an eye inside one of this region is the same but trained with specular sample (e.i using the same training set for left and right eyes searching simply considering the specularity of eyes pair using horizontal flip). Localize a eye in a face region produce the results presented in Figure 4.11. The scanning process find several possible eyes that insist over an overlapped region. Since the eyes region is already normalized, no tilting parameter is required while scanning.
94
Valerio Bellandi
Fig. 4.11 Eyes searching process. An examples of application using BioID database. In green the ′ set of probable eye location Eye, in red the final eye location Eye .
Considering the set of n rectangular region Eye = [xr,i , yr,i , xr, f , yr, f ] where r = 1 · n and xr,i , yr,i refers to left top corner of the rth rectangle while xr, f , yr, f refers to ′ right bottom corner, the eye location box (see the red one in Figure 4.11) is Eye = [∑nj=1 x j,i /n, ∑nj=1 y j,i /n, ∑nj=1 x j, f /n, ∑nj=1 y j, f /n]. Inside the detected eye region, further low level analysis are performed to obtain ′ the set of feature points. The low level analysis starts on normalized eye region Eye , using some initial points chosen by experience and automatically adapted them. These initial points (see Figure 4.12 in red) are adapted with multi step process relies on neighbor gradient-based analysis trough the final feature location (see Figure 4.12 in blue).
Fig. 4.12 Eye’s features low level searching. In red the initial points, in blu the final feature location after local searching. In green the intermedium result.
This low level process is experimentally driven. The precision of feature points selection presented is compatible as starting point for further refinement.
4.6 Model adaptation Accuracy in model initialization has an impact on the rest of the tracking process. Our model initialization refinement take advantage from the previous facial features points to obtain an initial rough model initialization (shape, expression and posture). In our previous work [21] fitting process was initially performed on the first frame by manually picking 28 fiducial points corresponding to relevant anatomical fea-
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
95
tures (eyebrows, eye, nose, mouth). Figure 4.13 shows the creation of an individual template thanks to 3D model initialization by hand-picking the fiducial points.
(a) Frontal view
(b) Feature Point
(c) Face mask
(d) Template 3D
Fig. 4.13 3D template extraction process on a sample taken from the MEED [22] ecological database, with uncontrolled illumination and subject movements .
Using the described automatic facial feature detection strategy, we no longer need any manual intervention in the initialization procedure. On the other hand we need to deal with less number of feature points and the growing uncertainty of these features respect with the manually picked ones. As already mention we use the features for rough posture and morphing related estimation and delegate a further refinement to a tracking approach. Therefore the overall 3D model adaptation is composed by: i) the initial rough model initialization (a two step process: feature-based posture evaluation and shape and expression inference), and ii) 3D model refinement. Following we describe in details every model adaptation step.
4.6.1 Feature-based pose estimation Our pose estimation algorithm relies on feature based posture evaluation called Modern POSIT [23]. This algorithm iteratively find the posture of 3D object simply having the relation between image feature points and the correspondent 3D object
96
Valerio Bellandi
points. Originally the POSIT [24] method for finding the pose of an object from a single view is a combination of two algorithms: i) POS(Pose from Orthography and Scaling) that approximates the perspective projection with scaled orthographic projection and finds the rotation matrix and the translation vector of the object by solving a linear system, ii) POSIT (POS with ITerations) that uses in its iteration loop the approximate pose found by POS in order to compute a better scaled orthographic projections of feature points, than applies POS to this projections instead of the original image projections. Modern POSIT using projective geometry that does not require the origin of the object coordinate system to be one of the image points of the image as occurs in classical POSIT. This approach was described first in [23] an it is simply an analytic formulation of the POSIT algorithm in homogeneous form. POSIT is based on classic pinhole camera model with: i) center of projection O, ii)image plane at distance fL (the focal length ) from O, iii) axes Ox and Oy pointing along the rows and columns of the camera sensor and third axis Oz pointing along the optical axis (Figure 4.14). The unit vectors for these three axes are called i, j and
Fig. 4.14 Posit perspective projection.
k. The intersection of the optical axis with the image plane is the image center C. An object with feature points M0 , M1 , . . . , Mi . . . Mn is located in the field of view of the camera The object coordinate frame is (M0 u, M0 v, M0 w). The coordinates Ui ,Vi ,Wi of the points Mi in this frame are known. The images of the points Mi are called mi and the image coordinates (xi, yi) of each mi are known. The rotation matrix R and translation vector T of the object in the camera coordinate system can be express in a transformation matrix P (pose matrix):
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
97
(4.28)
P1 = iu iv iw Tx P2 = ju jv jw Ty P3 = ku kv kw Tz H=0001
(4.29) (4.30) (4.31) (4.32)
P1 P2 P= P3 H
This Rototraslation matrix P is in the homogeneus form, therefore to obtain the coordinates of an object point Mi in the camera coordinate system, simply multiplies P by the coordinates of point Mi or vector M0 Mi in the homogeneous form (given a fourth coordinate equal to one). The fourth column of the matrix P is the translation vector T = [Tx , Ty , Tz , 1]T . The fundamental relations which relate the row vectors P1 , P2 of the pose matrix, the coordinates of the object vectors M0 Mi in the object coordinate system, and the coordinates xi and yi of the perspective images mi of Mi are: ′
M0 Mi · I = xi ′
M0 Mi · J = yi with J = Tfz P2 I = Tfz P1 , ′ ′ xi = xi (1 + εi ), yi = yi (1 + εi ) and εi = M0 Mi · P3 /Tz − 1
(4.33) (4.34)
(4.35)
(4.36)
The unknown coordinates (Xi,Yi, Zi) of vector M0 Mi in the camera coordinate system. Thus M0 Mi · P1 = Xi . For the same reason the dot product M0 Mi · P3 = Zi , thus (1 + εi ) = Zi /Tz Also, in perspective projection, the relation xi = f Xi /Zi holds between image point coordinates and object point coordinates in the camera coordinate system. Using these expressions in the equations above leads to identities which proves the validity of these equations. The terms εi are generally unknown. These terms depend on P3 which can be computed only after I and J have been computed. The coordinates xi = xi (1 + εi ) and yi = yi (1 + εi ) are the image coordinates of the object points Mi by a model of projection which is a scaled orthographic pro′ jection. Indeed xi = f Xi /Zi can be written as xi = 1+f εi /Ti , thus xi = f Xi /Tz In other ′ ′
words, image points (xi yi ) are obtained by ”flattening” the object by orthographic projection of its points onto the plane z = T z through M0 before performing a per′ spective projection. To obtain estimates for I and J use xi and yi instead of xi and ′ yi in Eqs 4.33 thereby making errors xi εi and yi εi which are added to the estimates of the image errors. Once estimates for I and J have been obtained, these estimates can be used to find more precise values of εi , which in turn lead to better estimates
98
Valerio Bellandi
of I and J. Therefore it can be solved by iterations. The steps of the iterative pose algorithm can be summarized as follows: 1. εi = best guess, or εi = 0 if no pose information is available 2. loop: Solve for I and J in the following systems: ′
′
M0 Mi · I = xi , M0 Mi · J = yi with ′
′
(4.37)
xi = xi (1 + εi ), yi = yi (1 + εi )
(4.38)
R1 = (I1 , I2 , I3 ) f /Tz = |R1 | i = (Tz / f )R1 P1 = (Tz / f )I
(4.39) (4.40) (4.41) (4.42)
3. From I get
Similar operations yield j and P2 from J. 4. k = i × j, P3 = (ku , kv , kw , Tz ), εi = M0 Mi · P3 /Tz − 1 5. If all εi are close enough to the εi from the previous loop, EXIT, else go to step 2 For instance the unknowns are the four coordinates (I1 , I2 , I3 , I4 ) of I, and we can write one equation with each of the object points Mi for which we know the position mi of the image and its image coordinate xi . One such equation has the form Ui I1 + ′ Vi I2 +Wi I3 + I4 = xi where (Ui ,Vi ,Wi , 1) are the four coordinates of Mi . Considering several object points Mi yields linear system of equations which can be written in matrix form as AI = Vx , where A is a matrix with ith row vector Ai = (Ui ,Vi ,Wi , 1) ′ and Vx is a column vector with ith coordinate equal to xi . The same discussion is valid for vector J. Since there are four unknown coordinates in vectors I and J, the matrix A must have at least rank 4 for the systems to provide solutions. This requirement is satisfied if the matrix has at least four rows and the object points are noncoplanar; therefore at least four non coplanar object points and their corresponding image points are required. These requirement is easily solved since we obtain more that 4 non coplanar points from our feature detection strategy, and the correlation was done by construction. Figure 4.15 shows examples of posture evaluation based on the selected features.
4.6.2 Shape and expression inference The initial posture inference suffers for the fact that the features points used to detect the posture are in general dependent from shape and expression of the 3D model, while the model used in posture evaluation is consider in the neutral state. Nevertheless this is only an initial rough process. Let consider the posture inference not far from the real one except for model shape and expression then, using a constrained
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
99
Fig. 4.15 Examples of posture evaluation with POSIT on BioID database [25]. The black circles represent the used feature points.
optimization algorithm [26], we compute the model’s shape σ and expression parameters α that minimize the error between the points p on the model and the ones obtained with previous posture estimation step. The sequential quadratic programming (SQP) algorithm is a generalization of Newton’s method for unconstrained optimization that attempts to solve a nonlinear program directly rather than convert it to a sequence of unconstrained minimization problems. We define the error to be minimizes as follows: n
Err(σ , α,t, R) = ∑ k wi ((gi (σ , α) ∗ R + t) − pi ) k2
(4.43)
i=1
Where pi is the correspondent model’s point of i-th chosen fiducial point, n is the number of detected points, and function gi maps the i-th points considering shape σ and expression α parameters. Then, we weight (using function w) the error across the model’s points according to the intrinsic insecurity of each point selection 8 . The general constrained optimization problem is to minimize a nonlinear function subject to nonlinear constraints, for a shake of conciseness let consider γ = [σ , α,t, R]: min Err(γ) s.t. ci (γ) ≤ 0, i ∈ τ ci (x) = 0, i ∈ ε
(4.44)
where each ci is a mapping from Rn to R, and τ and ε are index sets for inequality and equality constraints respectively. An SQP method uses a quadratic model for the objective function and a linear model of the constraints. Therefore the objective function is replaced with the quadratic approximation: 1 2 qk (d) = ∇ f (γk )T d + d T ∇γγ L (γk , λk )d 2
(4.45)
where L (γk , λk ) is the Lagragian function used to express first (necessary) and second order (sufficiency) conditions for local minimizer. The constraint functions are 8
Sometimes the selection of the vertexes of the eyebrows is difficult because of the hair
100
Valerio Bellandi
replaced by linear approximations so that the minimization problems becomes: min qk (d) s.t. ci (γk ) + ∇ci γkT d ≤ 0, i ∈ τ ci (γk ) + ∇ci γkT d = 0, i ∈ ε
(4.46)
The step dk is calculated by solving the quadratic subprogram. If the starting point γ0 is sufficiently close to (γ ∗ ) (or to the local minimizer) and the Lagrange multiplier estimates λk remain sufficiently close to λ ∗ or rather the Lagrange multiplier related to γ ∗ , then the sequence generated by setting γk + 1 = γk + dk converges to at a second-order rate. Considering qk the Lagrange multiplier estimates are needed. Most approaches estimate L matrix with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approximation which is updated at each iteration. Furthermore in many cases the convergence properties of the basic SQP algorithm is improved by using a line search strategy. As described the SQP approach permits to define a set of constraints for every parameter used in the minimization. This prevent wrong convergence to inconsistent face. The results of our SQP approach for shape and expression morphing is a more reliable face model with shape and expression close to the real face presented in the scene.
Fig. 4.16 Examples of shape and expression evaluation after POSIT and on UMIST and BioID database. The black circles represent the used feature points, The blue mask is the mask after the shape adaptation while the green one is after the the initial POSIT estimation
The precision obtained can be not enough. Figure 4.16 shows some examples of facial adaptation after posture estimation. In general the more the number of detected features the greater the precision of the results. In fact the SQP approach can estimate shape and expression where possible, if the information is not available (no feature point or only to few feature points) the
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
101
information can’t be obtained. For instance if we do not have information about mouth angles, with SQP we can’t obtain any shape or expression parameter that morph the mouth. This is always the case of nose since no feature points related to the nose are extracted by our method (e.i. the noses in Figure 4.16 are not adapted). The nose adaptation is completely done by model refinement Summarizing the advantage of having a further 3D model adaptation strategy relies on the fact that, even without having feature points informations, an estimation still be possible.
4.6.3 3D Tracking-based Model refinement The goal is to refine the posture and morphological estimation for obtaining the best fitted mask over the current subject. Of course as already mention, the 3D model refinement is composed by two process: ”initial refinement” and ”deep refinement”, therefore is not restricted only to what can be inferred using one frame. The overall refining process is therefore not confined in the initialization procedure. More in details the initial refinement used to improve the quality of mask initialization is performed one time on the initialization. The deep refinement aim to obtain a better fitting in the deep dimension is continuously performed while tracking. In this section we present both the initialize oriented model refinement and the deep oriented progressive refinement.
4.6.4 Initial refinement In our previous work we compute the initialization step (precise face posture, shape and expression) using several points selected by hand and directly applying the posture and shape-expression inference strategy described before (POSIT+SQP). This approach can not be used for automatical-selected feature for two reasons: i) the difficult to detect the points automatically with the great precision required, ii) the great number of required point for having a precise fitting. In fact as already mention in case of lack of feature points (more in the case where this lack is concentrated on a particular macro-feature like mouth) the pose+SQP approach is unable to completely estimate the shape. Furthermore in real application the precision of detected feature points are not comparable with hand picked ones, therefore the results of estimate posture, shape and expression using directly the automatic selected feature points are not enough (see Section 4.6.2). For these reasons the posture, shape and expression inferred with automatic feature points selection process is considered as rough initialization; our refinement becomes more useful in absence of informations or where these informations are noisily. To refine the quality of face location we use an extended version of our tracking algorithm that directly include the parametric texture property of model appearance.
102
Valerio Bellandi
In fact in terms of appearance, the texture of the subject can not be extracted using the rough initialization parameters. In fact considering the fitting results after the POSIT and SQP approach, if we try to extract the texture by back-projecting the 3D structure to the image plane, we obtain a distorted results including background etc. On the other hand to perform a tracking a face template is needed. The idea is to consider this template as a linear combination of model texture parameters therefore, the template appearance is considered initial unknown and included inside the variables that need to be minimized. Summarizing we include morphological and appearance basis in our minimization techniques obtaining a technique to refine a rough localization. Our appearance basis include average face Tavg and the five principal appearance components. As in the eigenface approach, a linear combination of this appearance basis can produce a certain appearance. Considering this idea, we can introduce in our tracking algorithm the appearance parameters using a linear appearance strategy. Our goal is therefore to obtain posture, SU estimation parameters and EU deformation parameters in one minimization process between one frame and a average template of face Tavg plus the linear combination of appearance basis (that we call Tavg+app ) that produce the most accurate approximation of face. We rely on the idea that face 2D template Tavg+app (x), will appear in the actual frame I(x) albeit warped by W (x; p), where p = (p1 , . . . , pn , α1 , . . . , αm ) is vector of parameters for 3D face model with m shape-expression parameters (SU and EU) and x are pixel coordinate from image plain. Thanks to this assumption, we can obtain the movement and shape-expression parameter p by minimizing function (4.20). If Tavg+app (x) is the template with correct pose morphology and expression p and I(x) is the actual frame, assuming that the image do not differs to much respect with reconstruct Tavg+app and that the preliminary detection posture has the satisfactory precision, the correct pose, shape and expression p can be obtained by minimizing the sum of the square errors between Tavg+app and I(W (x; p)): 2 (4.47) ∑[I(W (x; p)) − Tavg+app ] x
This minimization is at the base of our tracking process. In this formulation we express directly the dependency on the appearance reconstruction of out template T Now to include the estimation of eigenface parameters inside our minimization algorithm, we consider the problem as a linear appearance model so the Tavg+app becomes only the average face Tavg and the appearance parameters are put inside the appearance factor. As occurs for illumination basis: m
n
i=1
j=1
∑[I(W (x; p)) − Tavg (x) − ∑ λi Ai (x) − ∑ σ j App j (x)]2 x
(4.48)
where Ai with i = 1, . . . m and App j with j = 1, . . . nis a set of known appearance variation images and λi and σ j with i = 1, . . . m and i = 1, . . . n are the appearance parameters. So thanks to Linear Appearance Variations techniques, this function can be also minimized using a Lucas-Kanade like approach.
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
103
Fig. 4.17 Comparisons between initialization with preliminary estimation (yellow mask), and after the tracking refinement (solid black line) on MMI database collected by M. Pantic & M.F. Valstar [10] and Cohn-Kanade database [27]. In red square the initial 5 points.
Fig. 4.18 Comparisons between initialization with preliminary estimation (yellow mask), and after the tracking refinement (solid black line) on Cohn-Kanade database [27]. In red square the initial 5 points.
The proposed technique works well under the constraint that the initial rough initialization needs to be not far from the real posture, at least for translation and tilting. Furthermore, to perform the convergence, the frame image needs to be blurred for eliminating the initial difference between real face and the reconstructed ones. Figures 4.18 and 4.17 show some examples of refinement starting form a the worst possible condition or rather only 5 features points. The results are nevertheless impressive.
4.6.5 Deep refinement The idea under the progressive deep facial shape refinement is based on the fact that the real 3D shape can not be captured using only one shot images. Therefore can
104
Valerio Bellandi
not be captured at the initialization time. This process is, for construction, used at tracking time. In theory a couple of images taken with a stereo camera, or a couple of images with correlation between several fiducial points, is needed for 3D reconstruction. To obtain the deep adaptation we use the deep oriented SU. These deep shape adaptation is obtained using a time-variant approach for shape refinement. In other words the tracking process morph the 3D face Template in accordance not only with expression basis (EU) but also with shape ones (SU). Shape morphing while tracking is used until the adaptation is performed. In general the adaptation is performed only when the posture related to camera permits to inference some information about the deep dimension. Therefore the progressive refinement process is triggered by posture evaluation while tracking. Figure 4.19 shows two frame during tracking (with and without Shape deep adaptation), note that the posture permits to inference the deep dimensions.
(a) Without shape morphing
(b) With shape morphing
Fig. 4.19 Shape deep adaptation (SU 7,8,16,17,18) while tracking.
The deep adaptation permits to obtain a better fitting while tracking since the more precise the model is the more reliable the tracking will be. Nevertheless, our experience has shown that an acceptable tracking can still be achieved even without using the deep adaptation, thanks to our specific ”Template management” approach. Figure 4.20 shows an other example of tracking with and without 3D deep adaptation. Even in this noisy situation both normal tracking than the deep one archive a good quality. Of course the estimation precision of the deep adjusted one if greater than the non adapted one.
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
(a) Without shape morphing
105
(b) With shape morphing
Fig. 4.20 Shape deep adaptation while tracking in noise environment. Note also a great source of light that complicate the tracking and adjustment process.
4.7 Experimental results We have conducted several separated test for evaluating our approach. Since the great number of used database, we summarize their attribute and characteristics in Table 4.3 More in details: Database
ID
BioID
(1)
MMI
(2)
Cohn-Kanade (3)
Hammal-Caplier (4)
MEED
(5)
Characteristic Image sequences database[25]. Not very far from frontal pose with moderate difference in expression and illumination. Video based Facial Expression Database collected by M. Pantic and M.F. Valstar [10] with frontal and static positioned faces. Moderate and localize changing in expression. Image-Video based database [27] in gray scale with frontal faces and greater changing in expression. Image-Video database with frontal view and [11] complex ecological expression. Long video based database recorded during emotional oriented [22] experiment in front of PC.
Table 4.3 Testing Databases with main characteristics.
• BioID is recording with special emphasis on ”real world” conditions, therefore the testset features include a large variety of illumination, background and face size. In particular the dataset consists of 1521 gray level images with a resolution of 384x286 pixel. Each one shows the frontal view of a face of one out of 23
106
•
•
•
•
Valerio Bellandi
different test persons. BioID is used for feature localization since already labelled with the eye position, and others features positions. The MMI Facial Expression database holds over 2000 videos and over 500 images of about 50 subjects displaying various facial expressions on command. The databases is provided with a FACS labelling for expression recognition as in the case of Cohn-Kanade ones. Cohn-Kanade database consist of approximately 500 image sequences from 100 subjects, the image sequence are taken with a video camera. Subjects range in age from 18 to 30 years. Sixty-five percent were female; 15 percent were AfricanAmerican and three percent Asian or Latino. Many wrong white compensation videos. The databases is provided with a FACS labelling for expression recognition, in particular subjects were instructed by an experimenter to perform a series facial displays that included single action units (e.g., AU 12, or lip corners pulled obliquely) and action unit combinations (e.g., AU 1+2, or inner and outer brows raised). Each begins from a neutral or nearly neutral face. Hammal-Caplier holds about 50 short image sequences from 16 subject, in the image sequences a complex ecological expression are presented. The main expression presented are: disgust, joy and fear. The MEED database is multidimensional ecological collected while subjects are in front of PC playing with a emotional oriented games. It olds over 30 long videos (20 minutes each) about 30 subjects. The databased is manually labeled using FACS by expert. Several changes in illumination due PC monitor in front of subject and sunlight coming from a window.
For the rest of the chapter we refer to each database using the ID of Table 4.3 instead of their name. Based on this set of carefully selected database, we build our experiments. These experiments are aimed at proving the quality of the initial mask fitting in comparison with a manual mask fitting. For these reason we will presents the results in terms of distance from the referenced manual fitted mask the automatical fitted ones and the manual one. Also the results before the 3D refinement and after the refinement will be presented. The results are presented considered the maximum number of features that our algorithm is able to detect (16 features: four on each eyes, two per eyebrows and four for the mouth). Our evaluation is based on the distance of located features from the real features in terms of euclidean distance di between the ith feature and the reference one obtained manually. This error distance is defined analogously as in [28]: Df =
1 i=n ∑ di ne i=1
(4.49)
e is the ground truth inter-ocular distance between left and right eye pupil, n is the number of features taken into account.For testing the model initialization procedure we compare a manual mask fitting mask ( a reference mask obtained by placing 28 fiducial points as in our previous work) with the automatic obtained one. Our evaluation is based on the distance from the real posture and shape Dm in terms of Equation 4.49 considering the vertexes of the projected 3D mask obtained with au-
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
107
tomatic process and the reference one obtained with manual positioning instead of the feature points. Since the 3D refinement process is used for overcome the problem of noise on feature point detection and lack of feature point, we present different test for evaluating the response of our system in presence of both two different type of problem. First of all we need to evaluate the quality of the reference mask respect to the noise that can be introduced by manual selection. In fact the reference mask is obtained by manual fitting of feature point followed by POSIT and SQP approach. This approach intrinsically tolerate a minimum noise, nevertheless we perform some test to evaluate it. To estimate the sensitivity of this method respect to errors in the accuracy of points selection, we also add a noise component. In the face processing field, noise has been traditionally modeled as imprecision due to the incomplete representation of the face data variance ([29]), which in turn was due to the reduction (e.g., by PCA) of the set of sampled points. Here, we adopt a noise model specifically aimed at face processing rather than acquisition: namely, we modeled noise as ′ ′ ′ ′ a random displacement (x , y ) where for x and y we adopt two independent random distribution with mean 0 and variance 5% of face’s width or height expressed in pixel. We add this noise to the selected points and perform 100 random perturbations of the initial points. We then observe the resulting reference face mask after minimization. The distance as defined in Equation 4.49 between each face from the initial one (without perturbation) permit to succesful finding the 98% of vertexes within 5% of the inter-oculad distance e. With this noise model, the reference mask adaptation has shown a great robustness with reference to noisy coordinates of the points picked initially. After evaluating the quality of the reference mask we analyze more in details the results of our automatic adaptation strategy presenting the experimental results considering two type of experiments: i)full of points experiments that indicate the quality of the results in case of all the 16 points detected by our automatical process, ii) minimum number of features points (5 points). For both two type we present the result before 3D mask refinement (using only POSIT+SQP(PSQP)) and after the 3D refinement (3DR). We decide to test the worst case with less points as possible since as already mention, in many case not every point can be detected even if our feature detection algorithm is able to detect at most 16 points. The lack of points force our tracking system to morph the model more heavily that in the other cases where an initial estimation was performed. This is equivalent to a crash test situation. The Table 4.4 synthesize the results for both two cases (5 and 16 points), of course the presented error is related the detected face (the best one of the two detection strategy is used). DB 3DR average 5 PSQP average 5 3DR average 16 PSQP average 16 (1) 0.0773 0.1061 0.0539 0.0831 (2) 0.0644 0.0883 0.0452 0.0655 (3) 0.0490 0.0737 0.0426 0.0531 (4) 0.0466 0.0602 0.0428 0.0566 Table 4.4 Testing results for model initialization. An average in terms distance Dm from estimated mask and the reference one are presented for both the 5 points cases and the 16 one.
108
Valerio Bellandi
From the results in the Table 4.4 we can note that the final improvement of our 3D techniques impact more on the 5 point based approach than on the 16 based one, but the improvement still exist also for the 16 based one. Some difference in the results depend strictly on the characteristic of the database taken into analysis. In fact the results on database (4) and (3) are already impressive even for 5 points based initialization. The reason is dependent on the fact that these database are quite simple, with frontal view and neutral initial expression. In case of database (1) the initial frame can include illumination variation, posture and moderate face expression. Therefore the results for database (1) requires more feature points for obtain a comparable results with the others (in fact the initial PSQP quality influence more the final results in complex situation than in simpler). Concerning the database (2), the initial frame used for mask fitting has a similar situation as in database (1) except for the fact that the posture is more near to frontal view, the results are for that reason better. In some cases the fitting procedure need to deal with complex illumination situation and the presence of some external object like glasses. Figure 4.21 show some example where these factors obviously increase the imprecision of the refinement approach on database (5).
(a)
(b)
(c)
Fig. 4.21 Some examples of complex initialization situation that produce imprecision in the mask initialization.In black dot line the mask before refinement while in red solid line the mask after the refinement.
More in detail the sub-figure (a) and (b) of Figure 4.21 presents some issue related on lateral light that produce imprecision on the face width. In the case of (b) the presence of glasses with monitor reflex constitute another source of probable errors, that in this case is overcome consider this effect as external occlusion (high residual error). In the sub-figure (c) of Figure 4.21 the light localized on the eyebrows produce a wrong eyebrows estimation. In this case unfortunately the effect isn’t treat as occlusion (to few residual error) and can’t be managed by our light variation management. Figure 4.22 shows a cumulative error distribution comparison from PSQP and 3DR proposed methods in the case of 16 points for database (3) and (1). Comparing the results in [28] regarding the database (1) we obtain a comparable results having 97% of faces vertexes within 10% of inter-ocular separation. Considering the fact that we compute the values for every one of the 113 vertex instead of
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
(a) Cohn-Kanade
109
(b) BioID
Fig. 4.22 Cumulative distribution of projected vertex to vertex measure Dm using databases (1),(3). A comparison between 3DR (solid) and PSQP (dot) is presented.
computing for 17 feature points, this results overcome the quality obtained in [28] for database (1). One of the main reason of this results is related to the use of 3D deformable as reference instead of feature points. In this manner as already discuss the intrinsic imprecision of manually piked feature points is attenuated.
Fig. 4.23 Initial mask fitting results. On the left POSIT (green), PSQP in blue, on the right the PSQP and the 3DR in red
Figure 4.23 shows the fitting results localized on mouth and nose macro-features. In particular it comes evident the refinement obtained over nose macro features. As already mentioned this refinement can be obtained only with 3DR, nevertheless also the mouth region is refined by 3DR obtaining a more precise lips fitting. Concluding we prove that our full automatic mask fitting process improve the results of PSQP and permits to obtain a good fitting mask even in the case of few fiducial points available.
110
Valerio Bellandi
References 1. Rydfalk M.: Candide, a parameterized face, report no. LiTH-ISY-I-866.: Technical report, Dept. of Electrical Engineering, Link oping University, (1987) 2. Bellandi, V., Anisetti, M., Beverina, F.: Upper-face expression features extraction system for video sequences. Proc. of International Conference on Visualization, Imaging, and Image Processing (VIIP05), pp. 83–88, Sept. (2005) 3. Damiani, E., Anisetti, M., Bellandi, V., Beverina, and F.: Facial identification problem: A tracking based approach. Proc. of IEEE International Symposium on Signal-Image Technology and InternetBased Systems (IEEE SITIS05), (2005) 4. Ekman, P., Friesen, W.: Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, (1978) 5. Turk, M., Pentland, A.: Face recognition using eigenfaces. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591, (1991) 6. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance mode. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 681–685, Jun. (2000) 7. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. Proc. of Computer Graphics, Annual Conference Series (SIGGRAPH), (1999) 8. Beverina, F., Palmas, G., Anisetti, M., Bellandi, V.: Tracking based face identification: A way to manage occlusions, and illumination, posture and expression changes. Proc. of IEE 2nd International Conference on Intelligent Environments IE06, (2006) 9. Gross, R., Matthews, I., Baker, S.: Active appearance models with occlusion. Image and Vision Computing, 24 (2006) 10. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In Proc. IEEE Int’l Conf. Multmedia and Expo (ICME’05), Amsterdam, The Netherlands, (2005) 11. Hammal, Z., Caplier, A., Rombaut M.: A fusion process based on belief theory for classification of facial basic emotions.: Proc. of Fusion’2005 the 8th International Conference on Information fusion (ISIF 2005), (2005) 12. Martinkauppi, B.: Face colour under varying illumination - analysis and applications. PhD thesis, Faculty of Technology, University of Oulu, (2002) 13. Shashua, A.: Geometry and Photometry in 3D Visual Recognition. PhD thesis, Massachusetts Institute of Technology, (1992) 14. La Cascia, M., Scarloff, S., Anthitsos, V.: Fast, reliable head tracking under varying illumination: An approach based on registration of texture-mapped 3d models. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22(4),322–336 (2000) 15. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. IEEE Transaction on Pattern Analysis and Machine Intelligence, 20(10), 322–336 (1998) 16. Lambert, J.: Photometria Sive de Mensura et Gradibus Luminus,Colorum et Umbrae. Eberhard Klett, (1760) 17. Basri, R., Jacobs, D.: Lambertian reflectance and linear subspaces. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(2), 383–390 (2003) 18. Eisert, P.: Very Low Bit-Rate Video Coding Using 3-D Models Dissertation. PhD thesis, Band 20, Kommunikations und Informationstechnik, Herausgeber, B. Girod und J. Huber, Shaker Verlag, ISBN 3-8265-8308-6 Aachen, (2001) 19. Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3), 221–255, February (2004) 20. Anisetti, M., Bellandi, V., Beverina, F.: Accurate 3d model based face tracking for facial expression recognition. Proc. of International Conference on Visualization, Imaging, and Image Processing (VIIPO5), pp. 93 – 98, (2005) 21. Anisetti, M., Bellandi, V., Damiani,E.: 3d expressive face model-based tracking algorithm. Proc. of the third Signal Processing, Pattern Recognition, and Applications(SPPRA), (2006)
4 Automatic 3D Facial Fitting for Tracking in Video Sequence
111
22. Ciceri, M.R., Balzarotti, S., Beverina, F., Manzoni, F., Piccini, L.: Meed: the challenge towards a multidimensional ecological emotion database. Proceeding of LREC 2006 Workshop on Corpora For Research On Emotion And Affect, (2006) 23. DeMenthon, D., Davis, L.S.: Recognition and tracking of 3d objects by 1d search. In Proc. Image Understanding Workshop, pp. 653–659, (1993) 24. DeMenthon, D., Davis, L.S.: Model-based object pose in 25 lines of code. International journal of computer vision, IJCV, 15(1), pp. 123–141, June (1995) 25. The bioid face database, (2001) 26. Powell, M.J.D: A fast algorithm for nonlinearly constrained optimization calculations. In G.A.Watson, editor, Numerical Analysis. Springer Verlag, (1978) 27. Kanade, T., Cohn, J., Tian, Y.: Comprehensive database for facial expression analysis. Proc. of 4th IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), pp. 46–53, Grenoble, France (2000) 28. Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. Proc. of the British Machine Vision Conf, (2006) 29. Zhao, W., Chellappa, R.: Face Processing: Advanced Modeling and Methods. Elsevier Academic Publisher, (2002)
Chapter 5
Input Devices and Interaction Techniques for VR-Enhanced Medicine Luigi Gallo and Giuseppe De Pietro
Summary. Virtual Reality (VR) technologies make it possible to reproduce faithfully real life events in computer-generated scenarios. This approach has the potential to simplify the way people solve problems, since they can take advantage of their real life experiences while interacting in synthetic worlds. In medicine, the application of these technologies and of the related communication interfaces could have a great impact on several fields, such as virtual endoscopy, surgical simulation and planning and medical education. Nonetheless, VR is still far away from being used in the daily clinical practice, being confined to specialist applications. In this study we try to outline the deficiencies of current VR-enhanced medical applications, focusing on field of medical imaging. We analyze the main requirements to produce effective systems suitable to be used by physicians, from the input device to the interaction techniques and metaphors. Moreover, we introduce the interactive system we are designing to allow a usable manipulation of 3D reconstructions of anatomical parts in virtual environments, which is based on the use of a handheld input device: the Wii controller. Key words: Virtual Reality, Medicine, Interaction Metaphors
Luigi Gallo ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, Italy e-mail:
[email protected]. it University of Naples “Parthenope”, Via A. F. Acton 38, 80133 Naples, Italy e-mail: gallo.l@ uniparthenope.it Giuseppe De Pietro ICAR-CNR, Via Pietro Castellino 111, 80131 Naples, Italy e-mail:
[email protected]. cnr.it E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_5, © Springer Science + Business Media, LLC 2009
116
Luigi Gallo and Giuseppe De Pietro
5.1 Introduction Nowadays, immersive Virtual Environments (VEs) are commonly used in several fields. The benefits of immersion, a large field of view and information-rich visualization, have been widely researched, and the increased availability of new low-cost and powerful stereoscopic display systems makes it possible for everyone to fully exploit virtual experiences. Nevertheless, many visually compelling VEs are still unsuitable for the solution of real-world problems, mainly because they are still too difficult to use. VR is not only a display technology, it is also a communication interface based on interactive 3D visualization. To make the most of VEs, users should be able to interact easily with virtual objects. The usability constraints are exacerbated in the medical scenario. Ten years ago, Zajtchuk and Satava in [34] surveyed the state of the art of medical applications of virtual reality. They highlighted four main areas: education and training; medical disaster planning and casualty care; virtual prototyping and rehabilitation and psychiatric therapy. In particular, they emphasized the opportunity of using VR technologies widely in the field of medical education, since in this field it is possible to build on the long history of simulated environments for aviation training. More recently, Imielinska and Molholt in [17] and Lin Yang et al. in [32] have analyzed the application of 3D visualization in anatomy teaching. In both studies, the authors emphasize how, after an initial enthusiastic response, only a few medical institutions remain committed to the implementation of the original vision. Since the 3D is preparatory to the immersive visualization, we can conclude that something has gone wrong in the transition from the 2D to the 3D visualization in a virtual environment. In [10], Brooks identified 3D interaction as one of the most important research areas to be examined in order to speed up the adoption of VR. Our idea is that, after almost a decade, 3D interaction is still the missing link. This consideration starts from a study of success stories in the application of VR technologies. In aviation training, for example, the use of VR technology is now a common technique. In this environment, the cockpit of the plane is accurately replicated so that pilots can interact in the same way that they would have done in the real plane. In the medical training, we are far from this result. Can a physician interact with an anatomical part in the same way as he does in the real world? To allow a natural interaction, the user should be provided with haptic devices, effective interaction techniques and powerful workstations to allow a realtime visualization of the anatomical parts. Additionally, if several research groups are involved in the design of futuristic interactive medical systems, we are forced to wonder how much immersion is enough. Do we need all the senses involved to allow a usable interaction with medical data? As Bowman et al. pointed out in [8], the key word in the development of usable 3D user interfaces is specificity. In this work, the authors argue that all of the existing interaction techniques exhibit a form of overgenerality. On the contrary, they have proposed five types of specificity that researchers should consider in their design:
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
117
application - domain - task - device - user. In simple terms, this means that the interaction techniques should be designed by considering specifically the group of users (and therefore their skills), the tasks they have to perform, the particular application and the input device used. The aim is to allow users to exploit their real world skills to interact in a specific virtual environment, so to minimize the learning process of the rules of interaction. According to this vision, the choice of the application area and the consequent analysis of the interaction requirements play a primary role. In this study, we focus on the field of medical imaging. In this area, 3D visualization is going to be widely used due to the availability of new multi-detector scanners. The traditional way of reviewing images slice-by-slice is too cumbersome to interpret the considerable number of images that can be routinely acquired. Therefore the inspection of 3D reconstructions of anatomical parts is becoming an inescapable necessity. Additionally, the immersive visualization of the information-rich 3D objects can provide clinicians with a clear perception of depth and shape, so allowing a more natural interaction. On the other hand, there are other constraints, related to the burdensome problems caused by the VR equipment, which must be taken into account when introducing VR technologies in daily clinical practice. Physicians need to inspect medical data several times a day and before and after performing other tasks. To guarantee freedom of movement, handheld input devices should be preferred to more powerful but also bulkier ones. The proposal we will describe in this study is the use of a semi-immersive visualization system together with a handheld off-the-shelf device, the Wiimote, also known as Wii Remote.
5.2 Related Works In recent years, many studies have been carried out on the design of usable 3D user interfaces for medical applications. To our knowledge, [16] is the first one in which the requirements for the introduction of VR technologies in clinical practice have been taken into account. In this work, the authors outlined how the requirements in the realm of medicine are quite different from those in the traditional areas, and how this difference has to affect the design of suitable interaction devices and systems. Moreover, they were the first to propose applying VR technologies and paradigms in daily clinical practice, so allowing the inspection not only of pregenerated anatomical models, but also of the medical data of individual patients. They also stressed the importance of the intuitiveness and convenience of the human-computer interface as a main criterion for its acceptance by physicians. Nowadays the availability of new powerful graphic systems allows us to overcome most of the problems that the authors outlined in their study. However, the issues they highlighted concerning the man-machine interaction are still topical.
118
Luigi Gallo and Giuseppe De Pietro
Most Medical Imaging software applications, which support medical experts in the 3D reconstruction process of anatomical structures coming from DICOM images, are capable of offering a three-dimensional visualization of all volumetric data with a high degree of accuracy [26, 28, 27]. However, they do not provide either stereoscopic visualization techniques (except the anaglyphic) or 3D interaction techniques required for the manipulation of 3D objects. Mainly, the interaction with 3D objects is performed by using traditional devices (mouse, keyboard) and 2D based metaphors (drag and drop, WIMP). For instance, in [27] it is possible to use a Programmable Multifunction Jog Wheel. Its main function is to allow an easier control of all the parameters needed to guide the visualization of the 3D object. However, this device is equipped with an abundance of buttons, so it requires a considerable period of time to understand how to use it. It can be considered as an enhanced mouse but it is not equipped to provide a real 3D interaction. In [20] the author presents a VR-based visualization tool. The main feature of this system is the multimodality of the interface, since the user can interact combining voice and two-handed input (gloves). The aim is to allow users to take advantage of the communication skills that they have had a lifetime to acquire. However, the speech recognition often breaks down due to environmental noise causing erroneous operations. The idea of taking advantage of well-known communication skills is shared also by [7]. In this work, the authors propose a hybrid 2D-3D interface to explore the medical dataset. They argue that 2D actions can be performed with relatively high degree of precision, whereas 3D actions are executed at high speeds in specific task situations. Therefore it is a good idea to use both. They have developed themselves this type of input device, called eye of Ra, a mixture of a flying mouse and a tablet PC pen. A drawback of this system could be that the input device is not suitable for collaborative tasks between physicians, which are indeed very common in medical data inspection. Other interesting ongoing studies are those based on the use of haptic devices. In [24], for instance, the authors propose a multimodal VR system for medical education they call touch simulator. In more detail, the touch simulator has been realized as an interactive neuro-anatomical training simulator. The user can visualize and manipulate graphical information about the brain by a finger-touch on a brain-like shaped tangible object. Notwithstanding the degree of realism that a system like this could achieve, it is probably unsuitable for use in daily clinical practice. In fact, it is too cumbersome and, since the interaction metaphor is based on the use of tangible objects, is impracticable to be used to interact with generic anatomical parts. In [1], the authors evaluate the benefits and user acceptance of a multimodal interface in which the user interacts with a game-like interactive virtual reality application. It is important, from our point of view, that in this work the authors stress the importance of using non-obstructive interfaces. They want the user to be immersed and equipped with a natural interface without resorting to complex and obstructive hardware, such as Head Mounted Displays (HMDs) or data gloves. As previously stated, we have chosen to use a handheld input device to allow the 3D interaction with the medical data, the Wiimote. We have integrated this device
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
119
in a medical application (for more details, see [13]), where it is used to point and to manipulate 3D reconstructions of anatomical parts. Recently, Hansen et al. [15] have presented a system that allows the intraoperative modification of resection plans by using the Wiimote. It is worth noting that the authors use the device wrapped in a sterile plastic hull. This is a key aspect since conventional interaction devices (like the mouse or keyboard) or more advanced VR input devices are difficult to sterilize without reducing functionality.
5.3 Requirements Analysis The interaction techniques and the devices designed for a generic virtual reality application are not really suitable to be used in a medical scenario. To develop a usable interactive system, an analysis of clinicians’ needs with respect to the field of medical imaging is a step that cannot be omitted. We have carried out several interviews with clinicians, particularly radiologists, to understand how to simplify the execution of 3D interaction tasks. Moreover, we have designed several prototypes of interaction techniques that have been tested by clinicians and medical students. The aim of this testing was to identify the common skills they are gifted with, and so to allow the design of metaphors able to transfer these skills into the virtual world. The results we achieved can be summarized in the following points: • physicians prefer to use input devices with a high degree of affordance [22], that is, they prefer the devices, themselves to suggest how they may be interacted with. Following this cue, it could seem that data or pinch gloves are the best input devices for this area. However, we also have to consider that physicians strongly prefer non-obstructive to obstructive hardware components. They prefer to use an input device with a disposable metaphor: grab it, use it, drop it. Wearing and calibrating a data glove is instead not a quick task. Moreover, a glove gives the user an erroneous perceived affordance to provide a tactile feedback; • the display system also has to be non-obstructive. As for the input device, the display system has to leave the user free to move. Fully immersive displays (head mounted displays, arm-mounted displays, virtual retinal displays), block out the real world. For this reason, physical objects require a graphical representation in the virtual world. In addition, the input device may be difficult to use because it cannot be seen. Semi-immersive displays (stereo monitors, workbenches, surround-screen virtual reality systems), instead, allow the user to see both the physical and virtual world [5]. Probably the best way to introduce VR into daily clinical practice is to provide physicians with a virtual window where they can view virtual objects. A full immersion (i.e. a virtual world), in this context, is unnecessary; • intuitive interaction techniques are preferred to complex ones so as to ensure that they are considered “user-friendly”. Clinicians are not inclined to use new tools if they require time-consuming training and system configuration;
120
Luigi Gallo and Giuseppe De Pietro
• a near-real-time interactivity a provided in order to allow a natural interaction. As outlined by Myers et al in [21], this is a central aspect of all three-dimensional interfaces. The effect of direct manipulation of 3D objects cannot be achieved if the system is not able to respond quickly enough. Particularly in the medical imaging area, the near-real-time interactivity places strong demands on performance, especially in graphical rendering; • ergonomics is a key aspect. If clinicians have to use the virtual reality system for a long time, an input device too heavy or difficult to handle may be rejected. In particular, clinicians outlined a requirement for lightweight interaction devices appropriate to the physical ergonomics of the hand. In the interaction technique design process, we have also to consider that not all the universal 3D tasks [5] (navigation, selection, manipulation, system control) are required in the interaction with volumetric medical data. As reported in [12], there are two different methods of operating in VR: one for industrial, architectural and art related applications, the other for clinical medicine or biomedical research. In the first method, the Virtual Reality Space Method (VRSM), the user is placed in a space to be navigated; since there is a large virtual space to explore, the navigation task is essential. In the second one, the Virtual Reality Object Method (VROM), there is only a data object to examine and manipulate and it is already within touching distance. To interact in a virtual medical imaging environment, clearly a VROM method should be applied. The need for navigation (travel and way-finding) is essential to perform virtual endoscopy, but not if the application focuses on a general-purpose visualization of a medical dataset. Even if there is only one object in the scene the selection task is still necessary. For a generic VR user there is only a single object in the scene, but for a clinician that object includes several different anatomic regions. So the selection task is not necessary to grab the object, that is always selected, but rather to “point to” a part of it. Obviously the manipulation task is essential. Once the reconstructed 3D object is visualized, clinicians want to inspect it accurately by rotating it, by zooming in and out or by cropping some of its parts. The accuracy of the input device is also an important issue. Especially in performing the pointing task, the input device has to be able to move the 3D pointer with an adequate degree of accuracy. Physicians are used to moving a 2D pointer with the mouse, one of the features of which is the device location persistence [33]: when we drop the mouse, the pointer remains fixed in the position we left it. If we use a direct pointing technique instead, we have to address the issue of hand tremors, which can make the pointing task really impracticable. If handheld input devices are used, the selection task also becomes difficult because of the Heisenberg effect: on a tracked device, a discrete input (e.g. a button press) often disturbs the position of the tracker (see [6]). Another important issue to be taken into account is cooperation. What we have observed is that the editing of medical data is often a collaborative task. Usually the process of drawing up a medical report involves two or three physicians at the same time. Different medical experts have different approaches according to their skills. A suitable interactive and immersive system should provide mechanisms not only
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
121
to visualize the same 3D object at the same time, but also to interact with it in a cooperative and/or competitive way. If haptic devices are not used, another choice to make is how to replace the tactile feedback. Visual feedback is only one of several possible solutions to confer on users an awareness of the state of the system, of what they have just done and what more they can do.
5.4 Interaction Metaphors and Techniques Interaction metaphors are probably the key to developing interaction interfaces focused on ease of use and ease of learning, in a word, usable. The metaphor mechanism consists in copying real world concepts in order to transfer users’ knowledge into a new context related to a task execution in a virtual world. A metaphor could be defined as realistic or magic, depending on the desired interaction type. The goal of realistic metaphors is to make the interaction work in exactly the same way as in the real world. On the contrary, the goal of magic metaphors is to confer on users new physical and cognitive abilities which are unavailable in the real world. A good metaphor needs to be both representative of the task and compatible with previous user knowledge. Moreover, it has to be designed taking the physical constraints of the input device into account. Basically, interaction techniques can be considered as implementations of interaction metaphors. In recent years, several task-based interaction techniques have been presented to select, manipulate and navigate through virtual objects, but most of them are based on the same basic interaction metaphors [23]. Realistic and magic metaphors have to be understood as two extremes, since most interaction techniques are based on both. Interaction designers use metaphors to convey knowledge from a source domain to a target domain. To our knowledge, the source domain considered is always the real world. We believe that, to design natural interfaces, in some application scenarios the source domain could be extended. As Bowman pointed out, usable interaction techniques have to be designed for a specific user, domain, task, device and application [8]. Our experience shows that expert 2D interface users for specific applications find it more natural to use 3D interaction techniques that extend the wellknown 2D ones. Following this view, the source domain of interaction metaphors could be extended to include not only the real world but also the common ways of performing tasks in traditional human-computer interfaces. This approach leads to a wider definition of natural interaction metaphors, in which not only “actions performed in the real-world” but also “actions based on familiar experiences” are included.
122
Luigi Gallo and Giuseppe De Pietro
5.4.1 Realistic Metaphors Realistic metaphors, also called natural, make it possible to interact in the virtual world exactly as in the real one. These metaphors are deduced by observing how we interact with the real world. This is probably the easiest way to achieve naturalness. As Aliakseyeu et al. pointed out [2], if a user can interact with a VE as in the real world, he does not need to pay attention to how to interact, but only on the task execution. Since the interaction interface becomes invisible, the realistic metaphor leads to a human-task interaction rather than a human-computer one. In other words, realistic metaphors are designed to minimize the cognitive overhead [11]. Users should not be distracted by the interface, they only have to use their real life skills to accomplish their tasks and goals. Aliakseyeu et al. also provide a general requirement for naturalness: atomic actions required to perform the interaction tasks of interest should match the atomic actions provided by the interaction device. For instance, if a user wants to rotate a virtual object, he has to grab the object, then rotate it and finally release it, exactly the same sequence of actions as is needed to perform this task in the real world.
5.4.1.1 A Realistic Metaphor: Virtual Hand. The Virtual Hand is probably the best known realistic metaphor. As the name suggests, it is based on a virtual representation of the user hand. Many interaction techniques are based on this metaphor, inasmuch as it allows a natural manipulation. Users are only required to move their hand in the desired orientation to rotate the object. Usually the drawback of interaction techniques based on the virtual hand metaphor is the inaccessibility of distant objects. In fact, if objects are placed in the scene far from the user, the anatomical limit of his arm makes it impossible to intersect them. Moreover, manipulation of large objects is often difficult, since the virtual hand is too small to rotate them properly [4]. It is worth noting that this metaphor suits the Aliakseyeu general requirement for naturalness, since the sequence of interactive actions required to manipulate in the VE is the same as that required in the real world.
5.4.2 Magic Metaphors Magic metaphors provide users with additional abilities that are unavailable in the real world. Extensible arms and laser rays emanating from the users’ hands, are only some of the proposed – and often used – metaphors for VEs. For this kind of metaphor, the design space is nearly unlimited. Usually techniques based on magic metaphors require a longer training time. Their naturalness is obviously less than the realistic metaphors.
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
123
5.4.2.1 A Magic Metaphor: Virtual Pointer. The Virtual Pointer metaphor could be used to select, manipulate and travel through virtual objects. It consists in visualizing a pointer on the scene, the position of which could be controlled in different ways. The Ray-Casting technique is the best known implementation of this metaphor. In this technique the direction of the virtual pointer is defined by the orientation of a virtual hand. As an example, to select a 3D object we can simply point at it with the virtual hand, that is an avatar of our real hand. This technique allows us to reach easily virtual objects that are far from the observer.
5.4.3 Pros and Cons of Realistic vs. Magic Interaction Metaphors Realistic metaphors have the potential to be the more usable ones. Users do not need particular training to interact with the VE since they can exploit their real-life experiences. But, because of the limited features of the technological intermediaries, natural 3D interaction cannot be completely reproduced [8]. Even the most powerful VEs cannot involve all human senses in the interaction. The consequence of such a limitation is that the execution of basic interaction tasks could be extremely inefficient or even impractical. In some scenarios, users need to interact without intrusive input devices, by using only their hands as end effectors. In these contexts haptic devices, able to provide users with force feedback, cannot be used. Therefore, grabbing or holding an object could seem unnatural. In other scenarios, the force feedback could be even unwanted. For instance, Bowman et al. [8] describe a virtual office design application. If designers want to try a new type of desk, surely they do not need to feel its weight. Moreover, they would probably prefer to choose the desk model from a menu instead of moving to a virtual furniture store. In accordance with these considerations, Bowman suggests implementing realistic interaction metaphors only when a replication of the physical world is absolutely necessary [5]. So if the goal of the virtual experience is to achieve a training transfer, i.e. to educate users to do the right thing at the right moment in the real world (such as in military or medical training), real metaphors are an unavoidable choice. On the contrary, when the goal of the virtual experience is to allow experiences that would be impossible in the real world, then magic metaphors are the better option [9].
124
Luigi Gallo and Giuseppe De Pietro
5.5 The Proposed Input Device: the Wiimote The Wiimote, alias Wii Controller or Wii Remote, is the primary controller for the Nintendo’s Wii console [31]. Basically, it is a wireless, ergonomic and economical input device. It weighs about 148 grams, its height is 14.8 cm, its width 3.62 cm and its thickness 3.08 cm. Users are able, thanks to the Wiimote’s motion sensing capability, to interact with and manipulate items on the screen, simply by moving the Wiimote in the space. The ease of use of this controller has made it very popular on the web. Many unofficial websites provide accurate technical information obtained by a reverse engineering process [30, 29]. In the following section we will briefly introduce the most important Wiimote features.
Fig. 5.1 The Wiimote.
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
125
5.5.1 Communication The Wiimote communicates via a Bluetooth wireless link. It follows the Bluetooth Human Interface Device (HID) standard, which is directly based upon the USB HID standard. It is able to send reports to the host with a maximum frequency of 100 reports per second. The Wiimote does not require authentication: once put in the discoverable mode, a Bluetooth HID driver on the host can query it and establish the connection.
5.5.2 Inputs The controller movements can be sensed, over a range of +/- 3g with 10% sensitivity, thanks to a 3-axis linear accelerometer. Therefore, it is possible to detect controller orientation in space. Calibration data are stored in the Wiimote flash memory, and can be modified via software. Another feature is the optical sensor at the front, able to track up to four infrared (IR) hotspots. By tracking the position of these points in the 2D camera field of view (with a resolution of 1024x768), accurate pointing information can be derived. There are 12 buttons on the Wiimote. Four of them are arranged in a directional pad, and the other eight spread over the controller.
5.5.3 Outputs The Wiimote can send outputs by using three different modalities: switching on/off up to four blue LEDs; vibrating and emitting sounds.
5.5.4 Classification The Wiimote is an input device that, by virtue of its features, can be considered an absolute novelty among 3D user interfaces. The presence of both an infrared camera and a three-axes accelerometer makes it possible to use it in many ways following different interaction metaphors. Zhai, in [33], classified input devices as isometric or isotonic. A device could be defined isometric if it senses force but does not perceptibly move (e.g. the spaceball), isotonic if it connects the human limb and machines through movement (i.e. the flying mouse). According to this classification criteria, the Wiimote could be defined as an isotonic device, suitable for the implementation of a zero-order control (a position control mode). Zhai asserts that generally, with this kind of control, advantages are the
126
Luigi Gallo and Giuseppe De Pietro
ease of learning and the fast speed, whereas disadvantages are the limited movement range and difficulty of device acquisition. The last assertion is not applicable to the Wiimote that, in contrast to most tracking devices, shows no problem in the acquisition of data. Following the directives Jacob provided in [18], the Wiimote could be also classified in terms of: • • • • • •
type of motion: both linear and rotary; absolute or relative measurement: absolute; physical property sensed: position; number of dimensions: potentially three linear and three angular; direct or indirect control: indirect; position or rate control: position.
5.6 The Proposed Interaction Techniques In this section we briefly introduce the 3D interaction techniques we are designing for interacting with volumetric medical data in a semi-immersive virtual environment. The emphasis is also on how these techniques have been mapped on the Nintendo Wiimote controller. This work is a part of the broader activities of the Advanced Medical Imaging and Computing labOratory (AMICO). More information can be found in [3]. After an analysis of the most frequently occurring tasks in medical data inspection, we have identified the following as necessary 3D interaction features: • • • • •
pointing - to point out a precise point of the visualized data; rotation - to visualize the 3D object from all possible points of view; translation - to move the object in the 3D space; zooming in / out - to better focus regions of data; cropping - to cut off and view inside the data.
The finite state machine of the 3D interaction model is reported in figure 5.2. The system consists of two macro-states: the Manipulation state and the Cropping state. The user can switch between these states by pushing a button on the input device.
5.6.1 The Manipulation State. In this state, the input device can be used both as a laser pointer-style device and in order to rotate and translate the 3D object.
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
127
Fig. 5.2 Finite state machine of 3D interaction.
5.6.1.1 Pointing. A flavor of the classic ray-casting technique has been used in order to control the position in space of a three-dimensional cursor. In the ray-casting technique a virtual light ray is used to select an object. The user has only to move it to the desired position. If this technique is applied to drive a cursor in a virtual environment, a problem arises: how to move it inward or outward? In order to solve this problem, we have chosen to adopt the fishing reel flavor of the ray-casting technique [4], in which users can move the cursor closer or farther away via two buttons. The same effect could be achieved by moving the input device inward or outward, but the fishing reel technique allows us to obtain a more precise movement of the pointer in the space. In a medical imaging scenario, where accuracy is essential, this has been an unavoidable choice. Recently we have also presented a new technique to adapt dynamically the pointer position so to move only on the visible surfaces of the 3D object [14]. Currently we are carrying out an evaluation to understand which technique gives the best results in terms of simplicity and accuracy. Once a point of interest has been located, the cursor can be fixed by simply pressing a button on the input device. A visual feedback (a change of the pointer color) is provided to users when the pointer intersects a volume or when the object is grabbed (so as to rotate it).
5.6.1.2 Translation and Zooming. In the manipulation state a user can also zoom in / out on the object and make a dolly in /out. Making a dolly consists in increasing / reducing the distance between the camera and the object (and consequently increasing / reducing the horizontal parallax and the eye angle), so as to modify the perception of depth and distance. The effects of these two features are really different: if the depth perception is increased, the user visualizes the object closer to him; if the object is zoomed, the
128
Luigi Gallo and Giuseppe De Pietro
user visualizes a bigger object but at the same distance from him. Once the distance between the object and the camera is fixed, the user can use the pad to move the object left, right, up or down. A mapping between the Wiimote buttons and the interaction features in the manipulation state is reported in table 5.1. Table 5.1 Wiimote button mapping in the manipulation state Wiimote Button
Action Performed
A B Rotate the object Dolly out + Dolly in PAD UP Move up the object PAD DOWN Move down the object PAD LEFT Move left the object PAD RIGHT Move right the object 1 Zoom out 2 Zoom in HOME Switch to the cropping state
5.6.1.3 Rotation. The rotation issue deserves more than a few words since executing a rotation in virtual space is probably one of the most complex tasks. In particular, when dealing with the inspection of 3D reconstructions of anatomical objects, the rotation task is probably the one that can make the difference between a system that is usable and one that is not. Physicians should be able to rotate 3D objects easily to inspect volumetric data from all possible points of view. To assist them in this task, we have developed three different interaction techniques based on realistic and magic metaphors. The first interaction technique we have implemented is based on the virtual hand metaphor. Since the Wiimote has a motion sensing capability, it is able to detect its orientation in space. By using this feature, we have implemented an interaction technique in which users have to press a button to grab the object, then rotate with 3 DoF the input device in order to orientate the 3D object accordingly, and finally fix the achieved orientation by releasing the rotate button. This technique is quite similar to the arm-extension one [4]. In the arm-extension technique the object manipulation can be performed through a data glove device, whereas in the proposed technique the user has to rotate the handheld input device. Medical experts, evaluating this rotation technique, have found it ineffective, for two main reasons:
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
129
1. input device technological limits – the Wiimote is not able to sense small rotations with an adequate degree of precision to perform an effective inspection; 2. difficulty of use – rotation along a desired, fixed axis can be a hard task; canonical 3D rotations (pitch, yaw, and roll) are difficult to perform since hand tremors or wrong hand movements can lead to an unwanted object orientation. On the basis of the informal user evaluation carried out on the first interaction technique developed, we have implemented a second one in which the rotation task can be performed with only 1 DoF. In order to provide an accurate rotation technique, in this second version we have chosen to allow rotations only on canonical axes, i.e. only pitch, yaw and roll rotations are permitted. The rotation axis is automatically identified. Users have to start rotating the input device, the system then identifies the rotation axis and visualizes a rotation aid on the screen. Thanks to the rotation aid, the system informs users about the identified axis. The aid also provides users with a scale, the role of which is to give information about the maximum amount of rotation that can be performed (Fig. 5.3.a).
Fig. 5.3 (a) 3D rotation about the Y axis (pitch). The system visualizes a rotation aid to support users in the rotation task. To rotate along a different axis, the user has to release and press again the rotate button on the input device. (b) 3D rotation through a pointer. The 3D pointer can be moved also inwards and outwards, but only its 2D position (x, y) is used to compute the rotation.
In fact, in order to overcome the Wiimote limitation in terms of motion sensing, in this technique the maximum amount of rotation has been scaled down. This means that there is no longer a geometrical correspondence 1:1 between the Wiimote and the virtual object rotations. Evaluation of this technique has shown a good usability. Clinicians have been able to perform easily the desired rotations. Nonetheless, they have pointed out some limitations: 1. long rotation time – this technique makes it possible to perform precise rotations, but wide rotations have to be performed in a sequence of smaller ones; 2. unmodifiable centre of rotation – 3D objects can be rotated only around their centre; on the contrary, clinicians want to be allowed to change it at run-time.
130
Luigi Gallo and Giuseppe De Pietro
The last interaction technique we have implemented is based on a pure magic metaphor. During the task analysis, we observed that clinicians are able to rotate easily 3D objects on 2D displays by using the mouse. Most medical imaging toolkits that offer 3D reconstruction (without stereoscopic visualization) allow the rotation of objects with a mouse-based technique. Users have simply to click the left mouse button and move the 2D pointer. The object will rotate towards the pointer with a speed varying according to the pointer movement breadth. Since clinicians find this way of rotating natural, we have tried to adapt it to VEs by replacing the 2D pointer with a 3D one, which can be moved in the scene in a laser style thanks to the optical sensor of the Wiimote. Rotations can be performed exactly as on the mouse-based 2D interface and the centre of rotation can be changed easily since it can be fixed by using the 3D pointer (Fig. 5.3.b). It is worth noting that, thanks to the availability on the Wiimote of a type of motion both linear and rotary, it is possible to avoid the rotational mismatch that happens during a mouse-based roll (i.e. a rotation about an axis aligned with the direction in which the camera is oriented). The mouse-based roll rotation is usually performed by pressing at the same time a key on the keyboard and a mouse button and by moving up / down (or left / right) the mouse. The hand movement is completely different from the 3D object one, and for this reason we can argue that this kind of rotation is counterintuitive. In all the proposed interaction techniques, we have exploited the Wiimote accelerometer to sense roll rotations, so as to avoid this kind of mismatch.
5.6.2 The Cropping State. In this state a volume-of-interest can be extracted from the 3D object. The cut is performed by using a cropping box that is an arbitrarily oriented hexahedron with orthogonal faces. This technique, the most commonly used for cutting 3D models, has been adapted in order to be suitable to a virtual environment interaction. In a 3D scene, visualized on a 2D monitor, users manipulate the cropping box (i.e. rotate it and move its faces) via mouse and keyboard. On the cropping box there are six handles (one for each face of the cropping box) that, once moused on, allow the user to move the faces. On the contrary, pointing and clicking directly a face allows the user to rotate the whole cropping box. The user is usually provided with a visual feedback in order to understand if he is able to translate a face or rotate the box. As we can see in figure 5.4, we have chosen to turn red the handle every time it is moused on. But if the user is immersed in a 3D scene, i.e. the interaction takes place in a virtual environment, this approach has to be changed. First of all a 2D pointer cannot be used, because it would disrupt the stereoscopic effect. But also, if a 3D pointer is used, it would be difficult for a user to select and move a handle in the 3D space. Moreover, in a bi-dimensional visualization, the user can mouse on the handle
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
131
Fig. 5.4 Cropping: (a) face translation (b) box rotation.
without being concerned about its position in the 3D space - only its projection on the 2D screen is relevant - whereas in an immersive environment this could create confusion. For these reasons, we do not use a pointer at all in the 3D object cropping. The cropping box can be rotated just like the 3D object, and its faces can be selected and translated simply by pushing the PAD buttons of the input device. In particular, the rotation technique is exactly the same as that used in the manipulation state. If users are able to rotate a 3D object via the Wiimote, they are able to rotate the cropping box as well. Table 5.2 Wiimote buttons mapping in the cropping state Wiimote Button
Action Performed
A B Rotate the cropping box Select the previous face of the box + Select the next face of the box PAD UP Move up the selected face PAD DOWN Move down the selected face PAD LEFT Move left the selected face PAD RIGHT Move right the selected face 1 2 HOME Switch to the manipulation state
The reason why cropping has been modeled as a state and not as a function is that it is a complex operation that includes several interaction steps. Switching to a
132
Luigi Gallo and Giuseppe De Pietro
new state make it possible to map the input device buttons and sensors to execute the single steps required. How the cropping features have been mapped onto the Wiimote device is reported in the table 5.2.
5.7 Discussion In this study, we have introduced the topic of interaction techniques and devices for medical applications. In particular, we have tried to outline the constraints to comply with in order to develop a usable interactive system for daily clinical practice. We claim that the weak link in this ongoing process is the interaction: the usability of 3D user interfaces is still not at a desirable level. Through discussing the advantages and disadvantages of existing realistic vs. magic approaches to the interface design, the main idea that has emerged is that the preferred way to provide sufficient usability in real world applications is to develop techniques and metaphors for specific scenarios. According to this view, an understanding of the requirements of the medical stakeholder is crucial in the design. We have also reported our experience in implementing 3D interaction techniques for medical VEs. We have observed that, physicians who have previously manipulated 3D objects on 2D medical imaging applications, seemed reasonably confident with the almost realistic interaction techniques we have developed. Especially for the rotation technique, medical experts have found it easy to learn probably because it is derived from their previous experience in the use of 2D computer interfaces based on the WIMP metaphor. Therefore, even if new input devices are used, we cannot ignore the fact that nowadays the mouse and keyboard can be considered in every respect as a part of the real world, therefore part of common education. As Jacob claimed in [19], naturalness and in general the principles of reality-based interaction, can be reduced if the trade off is a gain in expressive power, efficiency or practicality. However, the process of improving the naturalness of the human-computer interaction in this field is still in progress. Only if clinicians are able to exploit their real life abilities in virtual environments will they fully profit from the use of VR technologies in their everyday work.
References 1. Abaci, T., de Bondeli, R., C´ıger, J., Clavien, M., Erol, F., Guti´errez, M., Noverraz, S., Renault, O., Vexo, F., Thalmann, D.: Magic wand and the Enigma of the Sphinx. Computers & Graphics, 28(4), 477–484 (2004) 2. Aliakseyeu, D., Subramanian, S., Martens, J.-B., Rauterberg, M.: Interaction techniques for navigation through and manipulation of 2D and 3D data. In: 8th Eurographics Workshop on Virtual Environments, pp. 179–188. Eurographics Association, Aire-la-Ville (2002)
5 Input Devices and Interaction Techniques for VR-Enhanced Medicine
133
3. Advanced Medical Imaging and Computing labOratory (AMICO), aivalable at http://amico.icar.cnr.it/ 4. Bowman, D.A., Hodges, L.F.: An Evaluation of Techniques for Grabbing and Manipulating Remote Objects in Immersive Virtual Environments. In: 1997 Symposium on Interactive 3D Graphics, pp. 35–38. ACM Press, Providence (1997) 5. Bowman, D.A., Kruijff, E., LaViola, J.J.Jr., Poupyrev, I.: An Introduction to 3D User Interface Design. Presence, 10(1), 96–108 (2001) 6. Bowman, D.A., Wingrave, C.A., Campbell, J.M., Ly, V.W., Rhoton, C.J.: Novel Uses of Pinch GlovesTM for Virtual Environment Interaction Techniques. Virtual Reality, 6(3), 122–129 (2002) 7. Bornik, A., Beichel, R., Kruijff, E., Reitinger, B., Schmalstieg, D.: A Hybrid User Interface for Manipulation of Volumetric Medical Data. In: Symposium on 3D User Interfaces, pp. 29–36. IEEE Computer Society Press, Los Alamitos (2006) 8. Bowman, D.A., Chen, J., Chadwick, C.A, Lucas, J.F., Ray, A., Polys, N.F., Li, Q., Haciahmetoglu, Y., Kim, J., Kim, S., Boehringer, R., Ni, T.: New Directions in 3D User Interfaces. International Journal of Virtual Reality, 5(2), 3–14 (2006) 9. Bowman, D.A., McMahan, R.P.: Virtual Reality: How Much Immersion Is Enough?. Computer, 40(7), 36–43 (2007) 10. Brooks, F.P.Jr.: What’s Real About Virtual Reality?. IEEE Computer Graphics and Applications, 19(6), 16–27 (1999) 11. Conklin, J.: Hypertext: An Introduction and Survey. Computer, 20(9), 17–41 (1987) 12. Dech, F., Silverstein, J.C.: Rigorous Exploration of Medical Data in Collaborative Virtual Reality Applications. In: Sixth International Conference on Information Visualisation, pp. 32– 38. IEEE Computer Society, Los Alamitos (2002) 13. Gallo, L., De Pietro, G., Marra, I.: 3D interaction with volumetric medical data: experiencing the Wiimote. In: 1st international conference on Ambient media and systems, pp. 1–6. ICST, Brussels (2008) 14. Gallo, L., Minutolo, A.: A Natural Pointing Technique for Semi-Immersive Virtual Environments. To appear in: 5st Annual International Conference on Mobile and Ubiquitous Systems. ACM Press, Providence (2008) 15. Hansen, C., K¨ohn, A., Schlichting, S., Weiler, F., Zidowitz, S., Kleemann, M., Peitgen, H.O.: Intraoperative modification of resection plans for liver surgery. International Journal of Computer Assisted Radiology and Surgery, (2008) 16. Haubner, M., Krapichler, C., L¨osch, A.: Virtual reality in medicine-computer graphics and interaction techniques. IEEE Transactions on Information Technology in Biomedicine, 1(1), 61–72 (1997) 17. Imielinska, C., Molholt, P.: Incorporating 3D virtual anatomy into the medical curriculum. Communications of the ACM, 48(2), 49–54 (2005) 18. Jacob, R.J.K.: Human-computer interaction: input devices. ACM Computing Surveys, 28(1), 177–179 (1996) 19. Jacob, R.J.K., Girouard, A., Hirshfield, L.M., Horn, M.S., Shaer, O., Solovey, E.T., Zigelbaum, J.: Reality-based interaction: a framework for post-WIMP interfaces. In: twenty-sixth annual SIGCHI Conference on Human Factors in Computing Systems, pp. 201–210. ACM Press, New York (2008) 20. Laviola, J.J.Jr.: MSVT: A Virtual Reality-Based Multimodal Scientific Visualization Tool. In: IASTED International Conference on Computer Graphics and Imaging, pp. 221–225 (1999) 21. Myers, B., Hudson, E.H., Pausch, R.: Past, present, and future of user interface software tools. ACM Transactions on Computer-Human Interaction, 7(1), 3–28 (2000) 22. Norman, D.A.: The Design of Everyday Things. Doubleday, New York (1988) 23. Poupyrev, I., Weghorst, S., Billinghurst, M., Ichikawa, T.: Egocentric Object Manipulation in Virtual Environments: Evaluation of Interaction Techniques. Computer Graphics Forum, 17(3), 41–52 (1998) 24. Panchaphongsaphak, B., Burgkart, R., Riener, R.: Three-Dimensional Touch Interface for Medical Education. IEEE Transactions on Information Technology in Biomedicine, 11(3), 251–263 (2007)
134
Luigi Gallo and Giuseppe De Pietro
25. Patel, H., Stefani, O., Sharples, S., Hoffmann, H., Karaseitanidis, I., Amditis, A.: Human centred design of 3-D interaction devices to control virtual environments. International Journal of Human-Computer Studies, 64(3), 207–220 (2006) 26. Robb, R.A.: The Biomedical Imaging Resource at Mayo Clinic. IEEE Transactions on Medical Imaging, 20(9), 854–867 (2001) 27. Rosset, A., Spadola, L., Pysher, L., Ratib, O.: Navigating the Fifth Dimension: Innovative Interface for Multidimensional Multimodality Image Navigation. Informatics in Radiology, 26(1), 299–308 (2006) 28. Wu, Y., Yencharis, L.: Commercial 3-D Imaging Software Migrates to PC Medical Diagnostics. Advanced Imaging Magazine, 16–21 (1998) 29. WiiBrew Wiki, http://wiibrew.org/index.php?title=Wiimote 30. WiiLi, http://www.wiili.org 31. Nintendo Wii, http://www.nintendo.com/wii/what/controllers 32. Yang, L., Chen, J.X., Liu, Y.: Virtual Human Anatomy. Computing in Science and Engineering, 7(5), 71–73 (2005) 33. Zhai, S.: User performance in relation to 3D input device design. In: ACM SIGGRAPH Computer Graphics, pp. 50–54, ACM Press, New York (1998) 34. Zajtchuk, R., Satava, R.M.: Medical Applications of Virtual Reality. Communications of the ACM, 40(9), 63–64 (1997)
Chapter 6
Bridging Sensing and Decision Making in Ambient Intelligence Environments Elie Raad, Bechara Al Bouna and Richard Chbeir
Summary. Context-aware and Ambient Intelligence environments represent one of the emerging issues in the last decade. In such intelligent environments, information is gathered to provide, on one hand, autonomic and easy to manage applications, and, on the other, secured access controlled environments. Several approaches have been defined in the literature to describe context-aware application with techniques to capture and represent information related to a specified domain. However and to the best of our knowledge, none has questioned the reliability of the techniques used to extract meaningful knowledge needed for decision making especially if the information captured is of multimedia types (images, sound, videos, etc.). In this chapter, we propose an approach to bridge the gap between sensing and decision making and provide an uncertainty resolver to reduce faulty decisions based on uncertain knowledge extracted from unreliable techniques. We describe also a set of experiments elaborated to demonstrate the efficiency of our uncertainty resolver. Key words: context-aware application, semantic-based, uncertainty resolver model
Elie Raad LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex France e-mail:
[email protected] Bechara Al Bouna LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex France e-mail:
[email protected] Richard Chbeir LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex France e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_6, © Springer Science + Business Media, LLC 2009
136
Elie Raad, Bechara Al Bouna and Richard Chbeir
6.1 Introduction Nowadays, ambient intelligence is receiving lot of attention in several application domains due to its faculty to provide controlled environments when equipped with multimedia sensors such as cameras, microphones, and others. In fact, with the use of smart devices and embedded sensors, systems are able to maintain relevant information about users. Multimedia data describing users reveal interesting information about their location and context (user surrounding, moves, gesture, etc.). The work done by the research community has led to the definition of several context-aware or pervasive approaches aiming at, on one hand, improving access control models [1] [2] [3] [4] [5] [6] and, on the other, assisting the user in performing their daily tasks [7] [8]. Context-aware approaches are designed with the purpose of defining applications capable of managing themselves without the direct intervention of users. They are characterized by their ability to guide users to perform appropriate tasks while providing reasoning and decisions about the actions to be triggered. Context awareness in ambient intelligence environments is considered one of the key issues of evolution toward a total and automatic computing paradigm. It allows a system to integrate human’s ability in order to recognize and exploit implicit information related to users’ surrounding. A context-aware system is viewed as a two layered framework separating the decision making from the sensed multimedia information. One of the challenging issues to handle in context-aware computing is how to bridge the gap between sensing and decision making. Let us consider the following motivating scenario of a company equipped with multimedia devices (surveillance cams, microphones, sensors, etc.) in each room and hallway to provide interacted environment for its employees. The company installed a central unit holding the decision making tool using a set of explicit information managed by the administrator. Each of the multimedia devices sends captured information to the central unit which in turn analyzes it and invokes the appropriate tasks. In the central unit, a set of multimedia functions is used to detect and recognize people and objects in a certain context. Thus, integrating multimedia data and functions (face recognition, image similarity, object recognition, etc.) in ambient intelligence environments may lead to frustrating situations and thus uncertain decisions to take. This comes from several assets (lightings, electrical noise, functions’ relevance, etc.) and affects the related result. In order to validate a given fact, one should accept either a reasonable error risk or consider relying on several sources (e.g. taking several snapshots of an environment) using various multimedia functions enabling to retrieve the most appropriate result for a given situation. Consequently, we can easily pin down from this scenario two challenging issues: 1. How to allow an easy use of multimedia functions and the definition of new semantic-based functions after deriving a set of existing ones? This is of great importance to help the administrator to control his environment interaction and to evolve it accordingly when the company needs change.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
137
2. How to allow reducing the risk of faulty decision according to a set of fuzzy inputs? This is indispensable when involving multimedia functions that provide fuzzy results when comparing objects or extracting features. In this chapter, we address these two issues. We provide the concept of templates to facilitate the use of multimedia functions and derive new ones on the basis of combining existing ones. We also present here an uncertainty resolver model to reduce the potential risk of using multimedia functions in ambient intelligence environments. Our resolver is based on adaptations of known decision and probabilistic analysis techniques such as decisions trees [9], Bayesian networks [10], DempsterShafer theory [11], etc. to aggregate multimedia functions’ results into one relevant computed result. Through a set of experimental tests, we show how our proposed approach can be beneficial and be tuned. The rest of the chapter is defined as follows. In Section 2, we provide a quick overview on the context-aware models. In Section 3, we state the definitions needed to fully understand the proposed approach. Section 4 is devoted to describe the concept of template used in our approach to bridge the gap between the sensing and the decision making. A detailed description of our uncertainty resolver is given in Section 5 in which we provide a set of adapted aggregation functions used to reduce the uncertainty raised when processing multimedia objects. In Section 6, we demonstrate the efficiency related to integrating aggregation functions and their ability to reduce uncertainty and faulty decisions. Finally, we conclude the chapter and draw several perspectives.
6.2 Related Works Context-aware systems have recently attracted a lot of attention and have been widely explored in the literature. Many projects have elaborated the context awareness as a key feature for pervasive computing. It has been mainly presented as an important aspect to enforce access control models. In this section, we focus only on presenting how current approaches (formally) describe the context information. We also snapshot major approaches integrating the context as a key issue to enforce related access control models. Context modeling and representation were initially based on the context widgets [12] used to separate the high level context modeling from the capturing level. This evolved later with the integration of ontologies as a way to represent context information. Ontologies proved to be useful in representing the semantics behind a given domain in which context is modeled as concepts and facts. It succeeded in a wide definition of high conceptualization layer and enforced this with automatic inferred and logic reasoning using explicit and implicit rules. For instance, in [13] the authors define a hybrid conceptual model combining relational modeling and ontology based modeling in order to represent contextual information. Whereas in [14], the authors provide an interesting approach able to treat high-level implicit contexts derived from low-level explicit context information. The approach is based on a generic context which could be extended to target
138
Elie Raad, Bechara Al Bouna and Richard Chbeir
some domain specific concepts. In [15] and [16], both authors integrate contextual information in provided access control model to enforce the access decision and dynamically handle environmental changes in the user context. In [16], the authors provide an interesting model in which they incorporate the notion of trust. In this model, contextual information related to users’ environment can affect the trust level assigned to the subject. Nevertheless, these models do not migrate well to handle multimedia contextual information. They lack the ability to define complex policies in which multimedia contextual information is evaluated. In [2] the authors define a policy based on context aware service for network environments. The policy model is based on predicates such as time, location, activity, etc. Further, in [6], a context-aware authorization model is defined to enhance security management for Intranet protocols. Context rules are defined here as constraints and integrated in a role based environment. In [4], context is semantically defined using ontologies and integrated into an access control model for pervasive computing environments. However, the previously described approaches are considered domain-oriented and application-dependent. They lack the ability to define complex policies integrating multimedia contextual conditions. In [17] and [18], the authors propose a Generalized Role Based Access Control model which extends the know RBAC model by incorporating the notion of environment roles, and object roles in addition to the know subject roles. The model, motivated by the Aware Home1 security challenges, takes into consideration the information gathered from a variety of sensors and defines roles accordingly. Despite the efficiency provided by this model, it is commonly admitted that in large domains of application, the variety of roles is considered as a burden to handle by authorization managers. An interesting approach detailed in [19] describes a location-based access control able to authorize users on the basis of their location. Methods such as cell identification, angle of arrival, signal levels, etc. are used here to calculate the location position using GSM/3G devices and protocols. Nevertheless, such approach requires that each user (on which access should be controlled) carries a GSM/3G device which is not always feasible (for instance, in highly secured departments, users are forced to leave their devices at a security checkpoint). Furthermore, information acquired from cells is typically limited and endangered to noise and interference due to the wide areas the cells might cover. In [3], the authors enforce the role based access control model with context filters targeting context information. The proposed approach enforces, on one hand, the access to a subset of objects related to the user’s context and, on the other hand, users’ membership to a given role (or roles). However, the context filters represented in their approach are based on simple Boolean expression with logical and Boolean operators, and therefore they cannot be extended to deal with multimedia objects processing. The approaches discussed here are based on context-aware applications where information representing a given surrounding is gathered to make platforms more autonomic, on one hand, and secure on the other. However, few have tackled the problem of reliability behind multimedia objects processing and the decisions to 1
Aware Home is considered as a house with highly technical technology. It contains rich communication infrastructure in which sensors can capture and store a large size of information.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
139
make in order to execute a given operation. In fact, due to the complex structure of multimedia data, extracting meaningful information from them is complex, fuzzy, uncertain, and time-consuming. In the following, we describe our approach allowing to bridge the gap between sensing and decision making and discuss our uncertainty resolver proposed to reduce uncertainty and faulty decisions.
6.3 Preliminaries In the following, we present some definitions needed to fully understand our proposal. Definition 1 - Multimedia Object (Mo): allows representing several types of multimedia data such as text, image, video, etc. It is formally represented in our approach as 4-Tuples of the following form: Mo :< id, O, A, F >
(6.1)
where: • id: represents the identifier of the multimedia object. • O: contains the raw data of the object stored as a BLOB file or URI. • A: is the metadata describing the multimedia object. It describes multimedia object related information and can be written as: (a1:v1, a2:v2) where ai and vi represent an attribute and its corresponding value (ex. age: 18, profession: student, etc.) • ” F: describes the low-level features of a multimedia object (such as Color Histogram, Color distribution, Texture Histogram, shapes, Duration, Audio freq., Amplitude, Band no., etc.) For instance, the following multimedia object: Mo1 : (id = 1, O =′ photo bob. jpg′ , A =′ ob ject.name : Bob′ , F =′ DominantColor = (14, 24, 20)′ )
(6.2)
describes ”Bob” picture (the head of the research department in our motivating scenario) with its dominant color (using the RGB color space). Definition 2 - Multimedia Function (f): is used to handle the comparison2 and feature extraction of multimedia objects. Numerous types of multimedia functions are provided in several commercial tools and in the literature through various forms. For instance, several are provided in DBMSs SQL-operators such as Oracle and DB2 [20] [21] [22], while others are accessible via API functions [23] [24] and web 2 When handling rules whose conditions include multimedia objects, traditional logical operators such as ’equality’, ’greater than’ or others are not applicable due to the complex structure of multimedia objects and must be extended with similarity functions.
140
Elie Raad, Bechara Al Bouna and Richard Chbeir
service [25] for multimedia data processing. Details on such functions and their applications are out of the scope of this chapter. In our approach, we formally write a multimedia function f as: f (Mo j , Moi ) → α B f where: • Mo j represents a predefined Multimedia Object • Moi represents a captured multimedia object. It can be provided by multimedia devices of different types such as surveillance cameras, webcams, sound recorders, etc • α B f is a returned threshold representing the confidence score of a Boolean value B with respect to the multimedia function. It varies in [01]interval For instance, consider the fact that we wish to detect the presence of Bob, head of the research department, in a snapshot image. It is possible to use 2 different multimedia functions f1 and f2 to analyze multimedia objects where: • f1 : is related to the InterMedia Oracle module [22] and is used for image similarity, • f2 : is based on color object recognition and an SVM classifier. It computes decisions based on a set of classes representing the trained images (See [26] for more details) Thus, we can define the function contents as: f1 (Mo1 (. . . 0 =′ predBob. j pg′ . . . ), Mo10 (. . . , 0 =′ snapshot1 . jpg′ , . . . )) → 0, 5T f1 (Mo1 (. . . 0 =′ predBob. j pg′ . . . ), Mo10 (. . . , 0 =′ snapshot1 . jpg′ , . . . )) → 0, 8T (6.3)
where the predefined MO is a predBob. jpg image3 to be compared with an input object called snapshot1 . jpg using the two multimedia functions. The first function detected the white suit with a 0.5 value whereas the second has 0.8 value. Definition 3 - Multimedia Attribute Expression (MA): is a Boolean expression holding two set of multimedia objects as input for processing. It is formally described as: MA (MOi , MO j ) = µ({ f1 (MOk , MOl ), . . . , fn (MOk , MOl )} , {ε1 , . . . , εn })θ → υ
(6.4)
where: • MOi = Mo1 , . . . , Moi and MO j = Mo1 , . . . , Mo j represent respectively the predefined set of multimedia objects and the captured set of multimedia objects to be compared with. • µ is a aggregation function (to be detailed in Section 5) that holds a set of multimedia functions and uncertainty thresholds ε1 , . . . , εn . 3 Several functions require the signature of an image (e.g. dat files) during comparison. Here, we assume that signatures are generated on demand.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
141
• f represents a multimedia comparison function needed to compare MOk and MOl (MOk ⊆ MOi andMOl ⊆ MO j . • θ is a comparison operator containing traditional operators (e.g. =, 6=, <, >, ≤ , ≥, etc.). • υ is the validation score. MOA (MOi , MO j ) is satisfied if the result returned from the aggregation function compared to υ is valid. An example of using MA will be provided later on.
6.4 Templates The concept of template is used in our approach to put together common features needed to handle multimedia processing methods and functions. Templates provide, on one hand, ease of administration when using predefined templates for special use cases and, on the other hand, flexible manipulation of a set of multimedia functions to offer precision, time saving, and minimum uncertainty risks. They allow handling the analysis of a set of captured multimedia objects using multimedia functions and aggregation functions (to be described later). A template is formally described in our approach as follows: T :< Id, Desc, MA (MOi , MO j ) > where: • Id represents the identifier of the template • Desc is the textual description of the template • MA (MOi , MO j )is a Boolean expression holding the multimedia attributes needed to process the set of multimedia objects MOi and MO j For instance, consider the fact that we wish to detect (using the two different multimedia functions f1 and f2 defined earlier) the presence of Bob, head of the research department, in the conference room in order to alert all the employees in the office and make the environment adequate for a conference. In order to provide simple manipulation and reduce uncertain decisions, we define a ”Face Identification” template needed to combine both multimedia functions and aggregate their results using, for instance, an Average aggregate function Avg (detailed in the next section). Therefore, the template Face Identification can be defined as follows: T1 :< 001, ”Faceidenti f ication”, MA (Mo1 , Mo10 , Mo20 ) > where: • MA (Mo1 , Mo10 , Mo20 ) = Avg( f1 (Mo1 , Mo10 ) f2 (Mo1 , Mo20 ), 0.5, 0.8) > 0.7. • Mo1 = predBob. jpg. • Mo10 and Mo20 represent the multimedia objects.
142
Elie Raad, Bechara Al Bouna and Richard Chbeir
The template T1 is satisfied only if its MA is satisfied (which means that the score returned from the average aggregation function is greater than 0.7).
6.5 Uncertainty Resolver via Aggregation Functions Integrating multimedia data and functions (face recognition, image similarity, object recognition, etc.) in Ambient intelligence environments may lead to complex situations and thus uncertain decisions to take. As mentioned before, this comes from several assets (lightings, electrical noise, functions’ relevance, etc.) and affects the related result. In order to validate a given fact and to avoid error risk, one should consider relying on several sources (e.g. taking several snapshots of an environment) using various multimedia functions to retrieve the most appropriate result for a given situation. This is why, we introduce here the concept of aggregation functions aiming to reduce uncertainty by filtering andor aggregating a set of values in order to select or compute one relevant value for facilitating decision-making. An aggregation function µ can be illustrated as in Figure 6.1 and defined by any probabilistic function such as the combination rule of Dempster and Shafer theory of evidence (DS) [27] [11] Bayesian Decision theory [10], Decision Trees [9], the average, the minimum, the maximum, and so on. It is formally written as:
Fig. 6.1 An aggregation function representation.
µ({ f1 (MOk , MOl ), . . . , fn (MOk , MOl )} , {ε1 , . . . , εn }) → αµB
(6.5)
where: • f is a multimedia function, • εan uncertainty threshold (∈ [0, 1]) representing the percentage of noise that can affect the result. In fact, ε can affect the overall thresholds returned by the multimedia functions or it can be related to each of the thresholds. In that case, the noise (ε) can be automatically calculated based on the difference between the environment state of the predefined MO and the state when the instances are captured (i.e. lighting changes and background detection). If omitted, ε = 0 meaning that no uncertainty detected. • αµB is the filtered confidence score (∈ [0, 1]) of a Boolean value B with respect to the aggregation function.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
143
Aggregation process defined here handles multimedia functions with probabilistic values that depend on the classification and relevance of the response they uncover. However, functions with different treatment cannot be invoked in the aggregation process or they should be normalized to suit the corresponding format. In the following, we present and detail how we adapted several existing aggregation functions such as average-based, Bayesian Network-based, Dempster and Shafer-based, and decision trees-based functions.
6.5.1 Average-based Function One of the most used aggregation function is average function. However, it cannot be used as it is in our approach as we need to integrate the uncertainty threshold related to each multimedia function (or the overall uncertainty threshold). The proposed adapted average function to filter out the set of results returned by the multimedia functions including the uncertainty thresholds is defined as follows: Avg ({ f1 (MOk , MOl ), . . . , fn (MOk MOl )} , {ε1 , . . . , εn }) =
∑ni=1 αiB B → αAvg n + ∑ni=1 εi
where: • αiB represents the thresholds returned by the multimedia functions • εi is the uncertainty threshold for a given multimedia function result. It is important to note that ∑ni=1 εi can be replaced by ε if an overall uncertainty threshold is defined in the aggregation function. Let us take the multimedia functions defined earlier f1 and f2 which return for each instance image seized a related result (e.g. f1 → 0.5T and f2 → 0.8T ). These thresholds are considered as 2 different results for the same fact (detecting the presence of Bob). Let ε = 0.2 be the predefined overall uncertainty threshold. After applying the Avg function, the computed result becomes: 0.5T + 0.8T = 0.59T Avg (0.5T , 0.8T ), 0.2 → 2 + 0.2 The decision is then made by comparing the predefined threshold with the calculated one.
6.5.2 Bayesian Network-Based Function Bayesian network (BN) is a probabilistic graphical model used to represent a set of variables and their probabilistic independencies. The graph G is defined as G= (V, E) where V is a set of vertices representing the variable of interest and E the dependencies relationship between these variables. Each random variable Vi can hold a finite set of mutually exclusive states and has a conditional probability table p(Vi |π(Vi ))
144
Elie Raad, Bechara Al Bouna and Richard Chbeir
where π(Vi) represents the parent set of Vi . The Bayesian Network encodes a joint probability over the set of variables defined by the following chain of rule: n P(Vi |π(Vi )) P(V ) = Πi=1
Hence, the BN is defined by the structure of the graph and by the conditional probability table of each variable. BNs have been applied to different domains including several intrusion detection techniques [28] [29] and Knowledge based authentication [30]. Due to its reliability, we chose to adapt in our approach a naive BN in order to aggregate multimedia functions returned results. The idea is based on the assumption that the validation of a given concept (such as detecting the presence of Bob in a given input image) takes place when several multimedia functions (or several input snapshots) returning a set of thresholds are filtered and their returned threshold is computed against a predefined threshold. Formally, we define the following notations and terms: • C denotes a class variable related to the concept claimed to be valid or not. The possible outcomes of C are defined as True or False depending on its state. • We define f = f1 , , fi representing the selected subset of variables related to the concept C. Each of the variables has the parent C and represents a factoid with a possible binary outcome {True, False} indicating whether the concept C is valid or not. The set of multimedia functions f (or one function with several input images to process) are considered variables related to C where the similarity values they compute affect the result of C. Given the joint distribution, the probability of the class C for a True Threshold value can be calculated using the Bayes’ rule: P(C = T |e) =
Πe⊆ f P(e|C = T )P(C = T ) P(e|C = T )P(C = T ) = F P(e) ∑b=T Πe⊆ f P(e|C = b)P(C = b)
e ∈ f and denotes the set of variable of interests representing different multimedia functions. The returned filtered result is equal to the value of P(C = T |e). As a result, the BN is described as follows: BN ({ f1 (MOk , MOl ), . . . , fn (MOk , MOl )} , {ε1 , . . . , εn }) → P(C = T |e)
To estimate the values of the conditional probability distributions for each node of the graph, we refer to the results returned by each of the multimedia functions designating the variables of interests. To preserve accuracy, we consider the probability of the parent Class C for a given value (true or false) equal to 0.5 which means that there exists 50% that the concept represented by the class C is valid and 50% otherwise. Referring to our motivating scenario, the template Face Identification contains the MA Boolean expression with two multimedia functions f1 and f2 . The parent class C represents the concept of detecting the presence of Bob in the captured images. Let us assume now that f1 returns a confidence score of 80% and f2 a confidence score of 60%. As mentioned before, the probability of the given concept
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
145
is equal to 0.5 for both binary values. However, the conditional probability distribution P( f1 |C) provided in Figure 6.2 is described as follows. When the concept is valid, then P( f1 = T ) = 0.8 which is the returned multimedia function’s result, and P( f1 = T ) = 0.2 otherwise. The same computation is applied on P( f2 |C).
Fig. 6.2 A BN applied example.
Finally the computed result is given by: BN ({ f1 (Mo1 , Mol0 ), . . . , fn (Mo2 , Mo20 )} , {0, 0}) P( f1 = T, f2 = T |C = T )P(C = T ) = P(e) Π P( f1 = T, f2 = T |C = T )P(C = T ) = 0.85 = F ∑b=T Π P( f1 = T, f2 = T |C = b)P(C = b)
(6.6)
However, the BN function described above calculates a resulting value without taking into consideration the uncertainty threshold specified for each multimedia function. Thus, integrating uncertainty threshold decreases the possibility of unauthorized access. In the BN function, is deduced from the observed nodes and target their observed values (see Figure 6.3). Now, let us consider the same motivating example with an uncertainty threshold equal to 0.1. After applying Bayes’ rule, the BN function computes the following value: BN ({ f1 (Mo1 , Mol0 ), . . . , fn (Mo2 , Mo20 )} , {0.1, 0.1}) Π P( f1 = T, f2 = T |C = T )P(C = T ) = 0.75 = F ∑b=T Π P( f1 = T, f2 = T |C = b)P(C = b)
(6.7)
146
Elie Raad, Bechara Al Bouna and Richard Chbeir
Fig. 6.3 A BN with deduced uncertainty threshold.
6.5.3 ”Dempster and Shafer”-Based Function Dempster and Shafer function is based on the mathematical theory of Dempster and Shafer which is used to calculate the probability of an event given a set of evidences. In this section, we provide an overview of the function with its possible adaptation to fit our objectives. 1. Frame of discernment (τ): represents the set of elements in which we are interested. In our approach, τ = True, False where the values true or false represents a result for a given fact (e.g. Bob is detected). Given the elements in τ, 2τ denotes all possible propositions that could describe τ represented as: P(τ) = φ , True, False, True, False. 2. Mass function (m): can be compared with a degree of confidence of an element. It is a basic probability assignment belonging to [0,1] which defines a mapping of the power set where 1 stands for a total confidence and 0 for no confidence at all. 3. Dempster’s rule combination: is used for gathering information to meaningfully summarize and simplify a corpus of data whether the data is coming from a single source or multiple sources. In our case, input multimedia objects can be acquired from one source or several sources. For this reason, each result returned from a multimedia function and related to an input multimedia object is considered as evidence with a calculated confidence. For instance, consider that the multimedia functions f1 and f2 would return for each instance image seized a related result (e.g. f1 → 0.5T and f 2 → 0.8T ) which are considered two different results for the same fact (detecting the presence of Bob). Given the two different evidences (representing the different thresholds calculated for the same fact) to support a certain proposition A, the rule combines them into one body of evidence. Thus, the rule determines a measure of agreement between the 2 evidences using: m12 (A) = m1 ⊗ m2 (A) =
∑B∩C=A m1 (B)m2 (C) 1−∑B∩C=φ m1 (B)m2 (C)
when A 6= φ
(6.8)
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
147
In our approach, we use the combination rule specified above to aggregate the returned result of a multimedia functions in order to get one representing threshold. In conjunction with the returned threshold, the multimedia predicate is evaluated. Let us go back again to our motivating scenario in which we wish to detect the presence Bob. For this reason, the system uses a face identification template with a MA holding • The combination rule of Dempsfer and Shafer’s theory of evidence (DS) as aggregation function with an uncertainty threshold equal to 0.1 • The two multimedia functions f1 and f2 described above are used to analyze multimedia objects. • The predefined Multimedia objects Mo1 , Mo2 representing Bob. Let us assume now that f1 returns a confidence score of 80% according to the captured snapshot Mo10 and f2 a confidence score of 60% according to the captured snapshot Mo20 . These scores represent the probability of validating the concept of detecting the Bob in the snapshot images. They are related to the mass functions (mi ) defined in the DS theory of evidence. For instance, in our case m1 (true) holds the probability of detecting the Bob which is determined using the first function whereas m2 (true) holds the same concept thus determined with the second function. According to the DS theory combination rule, we can compute the aggregated confidence score as follows: m12 (true) = m1 (true) × m2 (true) + m1 (true) × m2 (true, f alse) + m1 (true, f alse) × m2 (true) 1−K
(6.9)
where K = m1 (true) × m2 ( f alse) + m1 ( f alse) × m2 (true) represents the conflict in the combination rule In order to reflect the uncertainty when dealing with multimedia objects, the uncertainty threshold predefined for the combination of the multimedia functions f1 and f2 in the aggregation function is deduced from the probabilities of the concept to be determined: • m1 (true) = m1 (true) − m1 (true) × ε = 0.8 − 0.8 × 0.1 = 0.72 • m2 (true) = m2 (true) − m2 (true) × ε = 0.6 − 0.6 × 0.1 = 0.54 However, if m12(false) is being calculated, (the fact that the Bob is not detected in the captured image), the threshold would be deduced from m1(false) and m2(false) • m1 (true, f alse) = 1 − (m1 (true) + m1 ( f alse)) = 0.08 • m2 (true, f alse) = 1 − (m2 (true) + m2 ( f alse)) = 0.06 After applying the combination rule, we obtain the following aggregated confidence value: m12 (true) = 0.786 which, with respect to the previous parameters and assumptions, conducts the application to consider that Bob is not identified within the provided snapshot as it is below the acceptance threshold defined in the predicate.
148
Elie Raad, Bechara Al Bouna and Richard Chbeir
6.5.4 Decision Tree-Based Function Decision Trees (DT) are one of the supervised machine learning techniques based on logic methods of inferring classification rules. The most well-known decision tree induction algorithms for statistical uncertainty are ID3 [9] and its successor C4.5 [25]. Decision trees provide a solution applicable on situations that cover cognitive uncertainty, in particular vagueness and ambiguity. In the following, we show how we adapted the approach proposed by Y. Yuan and M. Shaw in [31] considered as predecessor of many other works related to decision trees with fuzzy concept. To help explaining this adaptation, we will illustrate each definition with the use of our motivating example. The input of the DT function is a case u belonging to a universe U (where U = u). In our example, we have two cases: u1 is the first input where (True = 0.8, False = 0.2) and u2 is the second input where (True=0.6 and False=0.4). Each ui is described by a set of attributes A = A1 , . . . , Ak In our example, we have only one attribute which is the Source of providing values (let’s say that a source can be simply a multimedia function). Ai has various linguistic terms T = T1 , . . . , Tk where T (A j ) is a possible value for the attribute A j . Here, the T has two values True and False, with a certain percentage for each one. Finally, each case will be classified in a class C = C1 , . . . ,C j which is the final result of the DT function, also called decision attribute. So: A = Source1 T Source1 = True, False
(6.10)
C = True, False
We will also use the function ρ to represent the membership degree of an object to an attribute. It returns any value in the interval [0, 1]. In our example, ρ will be the values associated to terms (true, false) by the source (e.g. true=0.6, false=0.4). Source C U\T True False True False u1 0.8 0.2 0.8 0.2 u2 0.6 0.4 0.6 0.4 Table 6.1 Sample Data Set.
Considering the sample data set provided in Table 6.1 where n input values from a source are provided to represent how much Bob (the person to be authenticated) has been detected in the captured images using a multimedia function. With the first input, Bob has been 80% detected, and with the second one he has been 60% detected. The values of the column C are the average of all the T terms of all the attributes A. In this scenario, we have only one attribute ”Source”, this is why the
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
149
True value of the decision attribute C is the same as the true value of the attribute Source. If we had two sources, Source1 and Source2 for example, with a value of 80% for True in Source1 and 60% for True in Source2, the True value of the decision attribute would be the average of the both sources which is 70%. Drawing the DT is not important in our adaptation. We only need to compute the ambiguity after each input. For the lowest ambiguity reached, we retrieve its values for the true and false attribute and use them for the final decision. To do so, we need to define the classification ambiguity G(P), with fuzzy portioning P. It measures the classification and is computed as the weighted average of classification ambiguity with each subset of the partition as follows: k
G(P) = ∑ w(Ei ) × G(Ei ) i=1
where: • G(Ei ) is the classification ambiguity with fuzzy Evidence Ei , • w(Ei ) is the weight that represents the relative size of subset Ei as: w(Ei ) =
M(Ei ) k ∑ j=1 M(E j )
where M is the cardinality measure (or sigma count) of a fuzzy set A, defined by M(A) = ∑u∈U µA (u) which is the measure of the size of A. For instance, G(Source) of our sample data set provided in Table 1 will be computed as follows: • After the first input: G(source) = w(True) × G(True) + w(False) × G(False) = 0.8 × 0.17 + 0.2 × 0.69 = 0.27 • After the second input: G(source) = w(True) × G(True) + w(False) × G(False) = 0.7 × 0.29 + 0.3 × 0.69 = 0.41 As we can see, after the second input, the ambiguity value is higher than the ambiguity value after the first input, thus we select the values of the lowest ambiguity and consider as a final result a detection confidence score of 80%. The adaptation of fuzzy decision tree to our approach allows us to select the input that engenders the least ambiguity. This can be extended to deal with many inputs coming from many sources; the most trusted source with the best engendered input is selected by retrieving the least ambiguous source and the least ambiguous input value as well.
150
Elie Raad, Bechara Al Bouna and Richard Chbeir
6.6 Experimentation In this section, we present a set of experiments elaborated to study the impact of using different aggregation functions and show how uncertainty can be reduced. The experiments were conducted using one PC Intel Pentium M having 1.73 GHz processor speed and 1GB RAM. We plugged a WebCam with 1.3 Megapixel resolution in video mode and 5 Megapixels in image mode to capture real time snapshots. The experiments were made in one of our laboratory rooms with various lighting conditions. Thus, we defined three profile environments: • P1 : representing a maximum lighted environment, • P2 : representing normal lighted environment • P3 : representing minimum lighted environment. The objects used to conduct the set of experiments are human faces, random objects and laboratory rooms to represent locations. We used the same multimedia functions described in our motivating scenario in Section 6.1: • f1 : is related to the InterMedia Oracle module, used for location and object identification. Predefined images needed for f1 are stored in an Oracle 10g database. • f2 : is an SVM classifier based on color object recognition and used for face identification and object identification. Whereas, the set of aggregation functions we used are: • • • • •
DS: representing the Dempster and Shafer - based function BN: representing the Bayesian Network - based function DT: representing the Decision Tree- based function Avg: representing the Average-based function Min and Max functions referring to the minimum or maximum value returned from the set of results of the multimedia functions.
The conducted experiments are divided into 3 different steps described as follows: 1. Aggregation Function Accuracy: to calculate the accuracy and time processing of each of the aggregation functions used 2. Value Distribution: to show the evolution of the aggregation functions results according to a set of manually generated values with different distributions 3. Template Tuning: refers to finding the appropriate template according to variable environmental conditions. In this test, we process the captured images under different profiles (P1 , P2 , P3 ) using several templates in order to determine the appropriate one in each profile. In the following, the results returned from the templates refer to the results of the multimedia attribute expression defined before its comparison with the predefined threshold. In other terms, they represent the results returned from the aggregation function µ( f1 (MOk , MOl ), . . . , fn (MOk , MOl ), ε1 , . . . , εn ) → αµB
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
151
6.6.1 Aggregation Function Accuracy and Time Processing When processing multimedia objects in Ambient Intelligence environments, unpredictable factors related to the context and the human behavior could affect the processing result and lead to positive or negative decisions considered frustrating in both cases. With the use of the uncertainty resolver and some aggregation function(s), we intent to minimize the probability of making faulty decisions by relying on several sources or captured snapshots w.r.t.4 to the processing time. Now, once we get the set of scores returned from the multimedia functions, the system should interpret and aggregate adequately these sets of scores into one and unique relevant score. The aim of this test is to study the effect of integrating aggregation functions to reduce faulty decisions. We used the multimedia function f1 and a set of learned images representing different objects which one of these objects is the face of Bob (the person to detect). In the first part of the test, we repeated 10 times the fact of capturing one snapshot of Bob (without invoking the aggregation function) whereas in the second part of the test, we repeated 10 times the fact of capturing 5 snapshots of Bob with different behaviors, and filtered them using the different aggregation functions. The accuracy of the aggregation functions is determined by comparing the aggregated results returned after 5 snapshots of Bob and the results of one snapshot of Bob without invoking the aggregation functions. The obtained results are shown in Table 6.2 and Table 6.3. Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Test9 Test10 0.702 0.364 0.609 0.514 0.609 0.562 0.603 0.512 0.484 0.517 Table 6.2 Values returned by multimedia functions without aggregation.
Min Max Avg DS BN DT
Test1 0.601 0.748 0.676 0.986 0.976 0.748
Test2 0.649 0.957 0.974 0.704 0.704 0.602
Test3 0.596 0.876 0.923 0.607 0.633 0.553
Test4 0.599 0.882 0.927 0.620 0.637 0.556
Test5 0.618 0.918 0.950 0.629 0.635 0.585
Test6 0.566 0.680 0.648 0.665 0.956 0.665
Test7 0.496 0.575 0.541 0.575 0.698 0.575
Test8 0.554 0.686 0.615 0.686 0.915 0.686
Test9 0.994 0.64 0.731 0.795 0.996 0.795
Test10 Accuracy 0.663 70% 0.704 90% 0.692 80% 0.744 80% 0.693 100% 0.744 70%
Table 6.3 Values filtered after 5 captured snapshots for each test.
According to the tables above and considering the fact that Bob appeared in several captured snapshots, we calculate the accuracy for each aggregation function i as follows: 4
With respect to.
152
Elie Raad, Bechara Al Bouna and Richard Chbeir
Accuracy(i) =
∑nj=1 Pj (Table2 ( j) < Table3 (i, j)) n
(6.11)
where • Table2 ( j) is the content of the jth column, and Table3 (i, j) is the content of the ith line and jth column • 1 i f Table2 ( j) < Table3 (i, j) (6.12) Pi (Table2 ( j) < Table3 (i, j)) = 0 i f Table2 ( j) > Table3 (i, j) • n is the number of tests elaborated. All of the aggregation functions returned an accuracy score relatively high which obviously prove their ability to minimize the risks of having false decisions when being integrated in a decision making system. In addition, Table 6.3 shows that the BN provides a maximum accuracy value. This means that according to the set of tests elaborated here, it is the most accurate aggregation function. Of course, this can change in other contexts. In addition to the accuracy tests, we studied the time processing of each aggregation function when integrated in a decision making system. The objective of this test is to show the influence of aggregation functions on the overall system performance. The set of input images were processed using multimedia function f2 . Figure 6.4 shows the results (in ms) of an incremented set of images starting from 2 inputs to 11 inputs tested on each aggregation function.
Fig. 6.4 Processing Time Results.
As we can see in here, the overall results are linear reflecting the fact that time increments with the number of inputs. The Min and Max functions need the minimum time to perform its aggregation while DT aggregation function requires the maximum time (we think that it is related to an implementation/optimization issue that we will solve soon). To conclude the first step of our tests, we can say that the accuracy and time processing of aggregation functions of our proposed approach sound obviously practical.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
153
6.6.2 Value Distribution In this experimental step, we elaborated a set of tests to observe aggregation functions attitude according to randomly generated values between 0 and 1. These set of values represent the degree of certainty in percentage regarding the recognition and identification of an object, person or location. We detail each set below.
6.6.2.1 Test 1: Values higher than 0.5 In this test, a set of 200 random values of a range between 0.5 and 1.00 was generated. As we will see, two different investigations of the 200 generated values were carried out each time with a different average value in order to evaluate aggregation functions while varying the Uncertainty Threshold (UT). In the first investigation, the average of the values was fixed to 0.95 (Figure 6.5). Here, we notice that the Min function is decreasing linearly as the UT is increasing. The highest value for this function is reached at UT = 0 and around 0.5 less than all the values of the other functions. The Max and the Avg functions decrease moderately while having close values. The DS and the BN functions remain constant until the BN reaches UT = 0.4 and DS reaches UT = 0.9 when they collapse dramatically to 0. The same result (=1) is for all UT, meanwhile all the others show a linear decrease. The DT function starts with values close to the Max function and goes down moderately until reaching UT = 0.5 where a quick drop of its values makes them move toward the value of the Min function.
Fig. 6.5 Evaluating aggregation functions with an average value = 0.95.
In the second investigation, the average of the values was decreased to 0.75 (Figure 6.6). The Min as well as the Max functions keep the same decline as in the first
154
Elie Raad, Bechara Al Bouna and Richard Chbeir
investigation, meanwhile the Avg shows the same decline as UT increases. However, we notice here that the overall average of the values is less than the ones in previous investigation. The BN converges to 0 at 0.4 (while it converged to 0 in the previous one at UT = 0.5). The same observation for the DS function that converges on UT = 0.7 in this investigation (and on UT = 0.9 in the previous one). Also, the DT function starts with values close to the Max function and it quickly drops down at UT = 0.3 and its values become closer to the Min function after UT = 0.4.
Fig. 6.6 Evaluating aggregation functions with an average value = 0.75.
6.6.2.2 Test 2: Values less than 0.5 In this test, a set of 200 random values of a range between 0 and 0.5 was generated. As we can see in Figure 6.7, we can observe that the Min, DS and BN functions are just a bit over 0. For the three other functions, the values regularly plunge to reach the lowest level 0 at UT = 1. Concerning the DT function, its values are close to the Avg function values. Max function has the highest values here.
6.6.2.3 Test 3: Random Values In this test, random values with no restrictions were generated and used as inputs of the aggregation functions. Two generated sets were used (Figure 6.8 and Figure 6.9). In the first set (Figure 6.8), the most obvious observation is that the DT function started with a value that increases starting from UT = 0 to reach its maximum of UT = 0.2 and then it decreases regularly.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
155
Fig. 6.7 Evaluating aggregation functions with an average value = 0.25.
Fig. 6.8 Evaluating aggregation functions using low average random values set.
In the second set having higher average random values (Figure 6.9), except the BN function, all the others were not affected by this change. The result of this test is shown in Figure 6.10. Here, the BN function remains constant with a value of 1 for the entire test. The Min function returns, as always, the lowest values. The DS function converges to 0 with a sudden change from 1 to 0 on UT = 0.3.
156
Elie Raad, Bechara Al Bouna and Richard Chbeir
Fig. 6.9 Evaluating aggregation functions using higher average random values set.
Fig. 6.10 Evaluating aggregation functions where 75% of the values are higher than 0.5 with an average of 0.62.
6.6.2.4 Test 5: 75% of the values are less than 0.5 In this test (Figure 6.11), once the majority of the values becomes closer to 0, the DS, BN, and Min functions return 0. The DT shows here an important decline when uncertainty threshold is located between 0.1 and 0.2.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
157
Fig. 6.11 Evaluating aggregation functions where 75% of the values are lower than 0.5 with an average of 0.35.
6.6.2.5 Test 6: Equally distributed values Here again in this test, the Min, DS, and BN functions have 0 for the entire test, meanwhile the DT shows for the second time a small raise followed by a slow decline.
Fig. 6.12 Evaluating aggregation functions with equally distributed values (50% lower and 50% higher than 0.5).
158
Elie Raad, Bechara Al Bouna and Richard Chbeir
6.6.2.6 Test 7: Distribution change In this test, a set of 200 values were generated. Initially, 0% of the values were less than 0.1. After, with each iteration, we increased the percentage continuously to study the behavior of the functions on each percentage (Figure 6.13). Since the maximum values were not a variable key in this test, the Max function is constant. The Min function has the following behavior: it has initially 0% of the values less than 0.1% before it promptly drops down when 10% of the values become less than 0.1. The Avg function is decreasing slowly with each distribution. The DT behaves similarly than the Min function when dropping down but it conserves slightly higher values along the entire test. The DS function provides a sudden change when 25% (precisely between 25% and 26%) of the values become less than 0.1; the same also happens with the BN after 25% of the distribution become less than 0.1 before reaching 0 when 30% of the distribution values are less than 0.1.
Fig. 6.13 Aggregation function behavior when distribution values change.
6.6.2.7 Test 8: Influence of the number of returned values 0 and 1 on the aggregated result The aim of this test is to study the impact of the number of returned values 0 and/or 1 on the decision. Figure 6.14 shows the computation results and aggregation function behavior that we try to explain here: • Min and Max functions: the Min function returns 0 when at least only one ”0” exists in the distribution. Similarly, the Max function returns 1 when at least only one ”1” value appears in the distribution.
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
159
• Avg function: the presence of 0 and 1 in the value distribution has not much influence on the avg function. • DS, BN, and DT functions: here, several cases can be identified: – ” if 99% of the values in the distribution are 1 and 1% contains 0 values, then DS, BN and DT functions output 0 as a result. If this 1% contains any number (6= 0), the output is always 1. – ” if 99% of the values are higher than 0.5 and only one value is 0 then: · DS and BN functions return 0. · DT function returns 0 when the ambiguity of a successive set of values gets lower than the current lowest ambiguity. – if 99% of the values are less than 0.5: · when 33% of the values are 1, DS function starts returning results higher than 0 and then it starts returning values close to 1 at 36%, · when 36% of the values are 1, BN function returns positive results and quickly after returns values that are equal 1 starting a percentage of 39% · DT function returns 1 when the ambiguity of a successive set of values gets lower than the current lowest ambiguity
Fig. 6.14 Influence of the percentage of the 1 value in a distribution.
6.6.2.8 Discussion Through this step of our experimentations, we aimed at studying the evolution and the behavior of the aggregation functions in response to different value distributions
160
Elie Raad, Bechara Al Bouna and Richard Chbeir
and threshold variations. Based on the computed results, we are able to pin down several observations. The Min function returns the minimum values among all the experiments. This means that it ignores the sensitivity of the variations of values (we say it is variation insensitive) and can be very useful when maximum security is required in an application. Similarly, the Max function returns the highest value in the distribution which can be used in several applications to alert and/or prevent users. The DT function has a similar behavior with the Min function when uncertainty threshold starts to increase (UT >= 0.5) and particularly when values are less than 0.5. However, in the distributions where the values are neither too high nor too low, the DT function returns nearby average values. This makes the DT function more practical to be used in applications where security is moderated. In addition, an important observation has been seen in Test 3 about the DT function where it starts with a value that increases starting from UT = 0 to reach its maximum with UT = 0.2, and then decreases continuously. This shows that when the uncertainty level increases, the returned results cannot decrease but increase as well. This is the law of uncertainty. We also concluded that the BN function is the most sensitive to variations. The DS function is similar but less sensitive. This has been observed with tests having small value changes where the DS and BN had a sudden change. When we proceeded with test 7, we had initially 0% of the value less than 0.1 and 100% of the values higher than 0.6. Then, we started exchanging the high values with values less than 0.1. When the percentage of the values less than 0.1 becomes 25%, the DS function converges to 0, meanwhile the BN needs 30% to converges to 0. This means that these two functions can be used in high secured applications having very sensitive cameras with powerful hardware. In test 8, as mentioned earlier, we can mainly observe that for the DS and the BN, a single 0 value in the distribution makes the output result 0 even if all the other values are 1. On the other hand, if all the values are 0 and we replace each 0 with a value of 1, the DS needs 38% of the values to become 1 so that the result increases towards 1, meanwhile the BN needs 40% to start increasing.
6.6.3 Template Tuning The objective of this step consists of determining the appropriate template for each profile representing a given environment. This would help the system administrator to tune his platform regarding the different environments that could affect the results of multimedia objects processing in an Ambient Intelligence environment. We proceed with the pre-analysis phases needed to start the experiments: the learning and template generation phases that we describe below: 1. Learning phase: the system learns to identify a concept which represents a human face, an object or a location. It refers to determining the set of predefined
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
161
multimedia objects in the different profiles. We specify 3 different set of predefined images taken in the 3 given profiles: B1 , B2 , and B3 representing a set of predefined images of a given object captured in P1 , P2 , and P3 respectively. 2. Template generation: here, we defined 6 different templates (T1 , T2 , T3 , T4 , T5 , T6 ) associated to the aggregation functions Min, Max, Avg, DS, BN and DT respectively. Thus, multimedia functions are used alternatively depending on the object to be detected. In case of a location or object identification, we used the f2 multimedia function, whereas for face identification, we used the f1 multimedia function. We proceeded as follows: a number of images are captured using our webcam and processed according to the defined templates holding multimedia and aggregation functions. The multimedia functions compare the captured images with the set of predefined images stored for each object, and return a set of similarity scores which are aggregated later on according to the aggregation function associated to the related template. And so, the appropriate template is determined based on the highest returned value in each profile (assuming that the captured images of the object are predefined for the multimedia function used).
6.6.3.1 Case 1: using the multimedia function f1 The predefined multimedia objects represent a collection of a person’s photos learned and assigned to the multimedia function f1 . We conducted a series of tests using, at each time, one of the three predefined multimedia objects (B1 , B2 and B3 ) under the 3 different profiles P1 , P2 and P3 respectively. The system captures 5 snapshots of the person and processes them using the multimedia function f1 and an uncertainty threshold = 0.1 in P1 , P2 and P3 . After, f1 returns 5 scores representing the degree of similarity of the captured image with the predefined one. For each profile, we retrieve the highest result after applying the corresponding aggregation function to the 5 scores returned by the multimedia function f1 . The Table 6.4 shows the results in each profile with different predefined set of multimedia objects (B1 , B2 and B3 ). Using B1 and P1 , the template T4 based on the DS function returned the highest result in the 3 different profiles, whereas T5 corresponding to the BN function returned almost the same values as the DS with a slight difference. B1 T1 T2 T3 T4 T5 T6
P1 0.6 0.6 0.6 0.98 0.88 0.6
P2 0.06 0.6 0.49 0.80 0.24 0.06
P3 0.06 0.06 0.06 0 0 0.06
B2 T1 T2 T3 T4 T5 T6
P1 0.06 0.6 0.38 0.14 0.013 0.06
P2 0.6 0.6 0.6 0.98 0.88 0.6
P3 0.06 0.06 0.06 0 0 0.06
B3 T1 T2 T3 T4 T5 T6
Table 6.4 Template tuning according to P1 , P2 and P3 using f1
P1 0.06 0.06 0.06 0 0 0.06
P2 0.06 0.06 0.06 0 0 0.06
P3 0.6 0.6 0.6 0.98 0.88 0.6
162
Elie Raad, Bechara Al Bouna and Richard Chbeir
6.6.3.2 Case 2: using the multimedia function f2 The predefined object here represents a location associated to the multimedia function f2 . As in case 1, we elaborate a series of tests using at each time one of the three predefined multimedia objects (B1 , B2 and B3 ) under the 3 different profiles P1 , P2 and P3 respectively. The system captures 5 snapshots of the location and processes them using the multimedia function f2 and an uncertainty threshold = 0.1 in P1 , P2 and P3 . The results are shown in Table 6.5. The template T4 and T5 returned the highest values in the 3 different profiles P1 , P2 and P3 , whereas, T1 , T2 , T3 and T6 returned the same values approximately. B1 T1 T2 T3 T4 T5 T6
P1 0.82 0.84 0.83 0.99 0.99 0.84
P2 0.66 0.71 0.69 0.99 0.98 0.71
P3 0.04 0.72 0.18 0 0 0.04
B2 T1 T2 T3 T4 T5 T6
P1 0.60 0.63 0.62 0.95 0.92 0.62
P2 0.67 0.8 0.75 0.99 0.99 0.8
P3 0.8 0.59 0.58 0.91 0.85 0.59
B3 T1 T2 T3 T4 T5 T6
P1 0.07 0.12 0.08 0 0 0.07
P2 0.05 0.05 0.05 0 0 0.05
P3 0.68 0.69 0.68 0.98 0.98 0.69
Table 6.5 Template tuning according to P1 , P2 and P3 using f2
6.6.3.3 Uncertainty threshold tuning We will calculate in the following the uncertainty threshold representing the difference between the returned results in each profile P1 , P2 and P3 . For instance, the ε = P1 − P2 describes the value that could affect the aggregated result if captured images are taken in 2 different profiles P1 and P2 . The Table 6.6 shows the calculated uncertainty thresholds between each profile for the specified predefined objects (B1 , B2 and B3 ). Using this uncertainty threshold calculation, one can adapt template usage based on the environments in which multimedia objects processing is taking place. B1 ε1 = P1 − P2 ε2 = P1 − P3 T1 0.16 0.78 T2 0.13 0.12 T3 0.14 0.65 T4 0 0.99 T5 0.01 0.99 T6 0.13 0.8
B2 ε1 = P2 − P1 ε2 = P2 − P3 T1 0.07 0.09 T2 0.17 0.21 T3 0.13 0.17 T4 0.04 0.08 T5 0.07 0.14 T6 0.18 0.21
Table 6.6 Uncertainty Threshold calculation
B3 ε1 = P3 − P1 ε1 = P3 − P1 T1 0.63 0.61 T2 0.64 0.57 T3 0.63 0.6 T4 0.98 0.98 T5 0.98 0.98 T6 0.64 0.62
6 Bridging Sensing and Decision Making in Ambient Intelligence Environments
163
6.7 Conclusion Through this chapter, we showed how it is possible to bridge sensing and decision making using predefined templates in ambient intelligence environments. Templates grouping multimedia processing tools facilitate the manipulation of complex multimedia techniques which makes the process of retrieving knowledge from contextual information an easy task. In addition, we presented our uncertainty resolver in which we defined a set of adapted aggregation functions based on several probabilistic theories used in the literature, in order to reduce the uncertainty raised due to multimedia objects processing. The aim was to aggregate the set of values returned from several multimedia functions or sources into one relevant score which reduces the probability of faulty decisions. We elaborated also a set of experiments to demonstrate the efficiency of our aggregation functions when integrated in realtime processing and image capturing scenario. Several observations have been discussed through the results obtained. In the near future, we intent to extend our approach to integrate several new aggregation functions (e.g. Artificial Neural Networks) in order to provide more efficiency in bridging the gap between sensing and decision making. Furthermore, we are currently implementing a prototype involving all the concepts provided here and willing to use it in real distributed application scenario(s).
References 1. J. Covington, M., J. Moyer, M., Ahamad, M. : Generalized Role-Based Access Control for Securing Future Applications. 23rd National Information Systems Security Conference, (pp. 16-19). Baltimore, MD, USA (2000) 2. Jean, K., Yang, K., Galis, A. : A Policy Based Context-aware Service for Next Generation Networks. 8th London Communication Symposium. London (2003) 3. Kumar, A., M. Karnik, N., and Chafle, G. : Context Sensitivity in role based access control. Operating Systems Reviews, 36(3), 53–66 (2002) 4. Toninelli, A., Montanari, R., Kagal, L., Lassila, O. : A Semantic Context-Aware Access Control Framework for Secure Collaborations in Pervasive Computing Environments. International Semantic Web Conference (pp. 473-486). Athens, GA, USA: Springer (2006) 5. Wolf, R., Schneider, M. : Context-Dependent Access Control for Web-Based Collaboration Environments with Role-based Approach. MMM-ACNS 2003, pp. 267-278. St. Petersburg, Russia: Springer (2003) 6. Wullems, C., Looi, M., Clark, A. : Towards Context-aware Security: An Authorization Architecture for Intranet Environments. PerCom Workshops, pp. 132-137. Orlando, Florida, USA: IEEE Computer Society (2004) 7. Davis, M., Smith, M., Stentiford, F., Bamidele, A., Canny, J., Good, N., King, A., Janakiraman, R. . Using Context and Similarity for Face and Location Identification. 18th Annual Symposium on Electronic Imaging Science and Technology Internet Imaging. San Jose, California: IS&TSPIE Press (2006) 8. Mitchell, S., Spiteri, M. D., Bates, J., and Coulouris, G.: Context-Aware Multimedia Computing in the Intelligent Hospital. SIGOPS EW2000, the Ninth ACM SIGOPS European Workshop (pp. 13-18). Kolding, Denmark: ACM (2000)
164
Elie Raad, Bechara Al Bouna and Richard Chbeir 9. Quinlan, J. R.: Induction of Decision Trees - Machine Learning. 1 (1986) 10. Poole, D.: Logic, Knowledge Representation, and Bayesian Decision Theory. Computational Logic, pp. 70-86. London, UK: Springer (2000) 11. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press (1976) 12. Dey, A. K., Abowd, G. D., Salber, D.: A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications. Human-Computer Interaction 16, 97–166 (2001) 13. Dejene, E., Scuturici, V.-M., Brunie, L.: Hybrid Approach to Collaborative ContextAware Service Platform for Pervasive Computing. Journal of Computers (JCP) , 40–50 (2008) 14. Wang, X., Zhang, D., Gu, T., Keng Pung, H.: Ontology Based Context Modeling and Reasoning using OWL. PerCom Workshops . Orlando, Florida, USA: IEEE (2004) 15. Bhatti, R., Bertino, E., Ghafoor, A.: A Trust-Based Context-Aware Access Control Model for Web-Services. Distributed and Parallel Databases , 83–105 (2005) 16. Hu, J., and C. Weaver, A.: A Dynamic, Context-Aware Security Infrastructure for Distributed Healthcare Applications. 5th Workshop on Pervasive Security Privacy and Trust (PSPT). MA, Boston (2004) 17. J. Covington, M., Fogla, P., Zhan, Z., Ahamad, M.: A Context-Aware Security Architecture for Emerging Applications. ACSAC 2002 (pp. 249-260). Las Vegas, NV, USA: IEEE Computer Society (2002) 18. J. Covington, M., Long, W., Srinivasan, S., K. Dey, A., Ahamad, M., D. Abowd, G.: Securing context-aware applications using environment roles. SACMAT 2001, pp. 1020. Chantilly, Virginia, USA: ACM (2001) 19. Ardagna, C. A., Cremonini, M., Damiani, E., De Capitani di Vimercati, S., Samarati, P.: Supporting location-based conditions in access control policies. ASIACCS 2006, pp. 212-222, Taipei, Taiwan: ACM (2006) 20. IBM. QBIC - DB2 Image Extenders. Retrieved 02 16, from http://wwwqbic.almaden.ibm.com (2008) 21. Nepal, S., V. Ramakrishna, M.: Query Processing Issues in Image (Multimedia) Databases. International Conference on Data Engineering (ICDE), pp. 22-29, Sydney, Australia (1999) 22. Network Oracle Technology. Oracle Multimedia. Retrieved 11 09, from http://www.oracle.com/technology/products/intermedia/index.html (2007) 23. Lab, efg’s Computer. Image Processing. Retrieved 01 02, from http://www.efg2.com/Lab/Library/ImageProcessing/SoftwarePackages.htm, (2008) 24. Press Java Community. Community Development of Java Technology Specifications. Retrieved 01 05, from http://jcp.org/en/jsr/detail?id=135, (2008) 25. Quinlan, J. R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers (1993) 26. Smach, F., Lemaitre, C., Miteran, J., Gauthier, J. P., Abid, M.: Colour Object recognition combining Motion Descriptors, Zernike Moments and Support Vector Machine. IEEE Industrial Electronics, IECON, pp. 3238-3242, Paris - France (2006) 27. P. Dempster, A.: A Generalization of the Bayesian Inference. Journal of Royal Statistical , 205–447 (1968) 28. An, X., Jutla, D., Cercone, N.: Privacy intrusion detection using dynamic Bayesian networks. 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet, pp. 208 - 215, Fredericton, New Brunswick, Canada (2006) 29. Jemili, F., Zaghdoud, M., Ben Ahmed, M.: A Framework for an Adaptive Intrusion Detection System using Bayesian Network. ISI, pp. 66-70, New Brunswick, New Jersey, USA Computer Science (2007) 30. Chen, Y., Liginlal, D.: Bayesian Networks for Knowledge-Based Authentication. IEEE Transactions on Knowledge and Data Engineering , 695–710 (2007) 31. Yuan, Y., Shaw, M. J.: Induction of fuzzy decision trees (1995)
Chapter 7
Ambient Intelligence in Multimedia and Virtual Reality Environments for the rehabilitation Attila Benko and Sik Lanyi Cecilia
Summary. This chapter presents a general overview about the use of multimedia and virtual reality in rehabilitation and assistive and preventive healthcare. This chapter deals with multimedia, virtual reality applications based AI intended for use by medical doctors, nurses, special teachers and further interested persons. It describes methods how multimedia and virtual reality is able to assist their work. These include the areas how multimedia and virtual reality can help the patients everyday life and their rehabilitation. In the second part of the chapter we present the Virtual Therapy Room (VTR) a realized application for aphasic patients that was created for practicing communication and expressing emotions in a group therapy setting. The VTR shows a room that contains a virtual therapist and four virtual patients (avatars). The avatars are utilizing their knowledge base in order to answer the questions of the user providing an AI environment for the rehabilitation. The user of the VTR is the aphasic patient who has to solve the exercises. The picture that is relevant for the actual task appears on the virtual blackboard. Patient answers questions of the virtual therapist. Questions are about pictures describing an activity or an object in different levels. Patient can ask an avatar for answer. If the avatar knows the answer the avatars emotion changes to happy instead of sad. The avatar expresses its emotions in different dimensions. Its behavior, face-mimic, voice-tone and response also changes. The emotion system can be described as a deterministic finite automaton where places are emotion-states and the transition function of the automaton is derived from the input-response reaction of an avatar. Natural language processing techniques were also implemented in order to establish highquality human-computer interface windows for each of the avatars. Aphasic patients
Attila Benko University of Pannonia, Egyetem street 10, H-8200 Veszprem, Hungary ,http://www. uni-pannon.hu, e-mail:
[email protected] Sik Lanyi Cecilia University of Pannonia, Egyetem street 10, H-8200 Veszprem, Hungary ,http://www. uni-pannon.hu, e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_7, © Springer Science + Business Media, LLC 2009
166
Attila Benko and Sik Lanyi Cecilia
are able to interact with avatars via these interfaces. At the end of the chapter we visualize the possible future research field. Key words: Virtual Reality, Rehabilitation, Virtual Therapy Room
7.1 Introduction Now we are very close to the end of the first decade of XXIst century, we can realize that alarming predictions from demographers during 90s are becoming true, the inversion of the population pyramid is not a hypothesis any more, it is in fact, the generation are getting old as for example in Europe, will soon reach rates of 20% (or even more) of people older than 65 years [16]. There is a correlation between aging and disability [1]. Manton has investigated the impact of age specific disability trends on long-term care needs [11]. Japan suffers from an extreme shortage rather than an undersupply of nursing home beds, and the Government has planned to vastly expand facilities under its Gold Plan, with some extensions in 1998 (the same would hold for Korea). If some de facto deinstitutionalisation has occurred in this country, it might reflect more institutional disequilibria in this market rather than a clear trend which can be projected into the future [15]. The common goal of these aging populations countries is to save the Activities of Daily Living (ADL). The rapid ageing of the population in OECD countries over the next few decades is expected to increase the demand for, and hence expenditure on, long-term care services. One factor that might help mitigate this pure demographic effect of population ageing on the demand for long-term care would be some steady improvements in the health and functional status of people aged 65 and over, which would enable them to live independently as long as possible [17]. This generation needs rehabilitation using assistive technology and monitoring systems. Most of these systems based on multimedia technology with Ambient Intelligence (AI). The other reason why we need these systems, is that the present middle-aged user group, now using the computer for work or entertainment, will soon move into old age [27]. At that time they will feel themselves more secure using a well-known environment like computer. It is the time to realize the problem and prepare for the solution. We should keep in mind today what we will experience when we grow old. We should design such a world now that will help us in the future! According to the Census 2000 definition, types of disability are: visually impaired, partially sighted, deaf, hard of hearing, mentally retarded, and physically disabled users [3]. Therefore in he following section we focus to the problems of these special needs users.
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
167
7.2 Using AI by special needs users Based on the profile above, special needs users are: visually impaired, partially sighted, deaf, hard of hearing, mentally disabled, and physically disabled users. The aging population sometimes has more disabilities. But the biggest problem of this population: life independently at home as long as possible, to be active aging, rehabilitation of several disabilities. In this section we show some examples how AI is used for their help.
7.2.1 Visual Impairment and Partially Sighted People It is very important to keep on the developers mind that the visual impaired and partially sighted people have no perfect vision. The degree of perfect vision is 1. A partially sighted persons vision degree is between 0.1 and 0.3. The blind people never get visual information. They would get the visual information using another information channel, like sound. The utilization of different media channels can be a useful resource in this case [4]. In general, empirical studies concluded that multimedia content facilitates information acquisition when: (i) the media support dual coding of information (information is processed through one of two generally independent channels); (ii) the media support one another (showing closely related, supportive information); (iii) the media helps the user to construct cognitive models, and; (iv) media are presented to users with low prior knowledge or aptitude in the domain being treated. For example, using screen reader as a way of InformationCommunication Technology (IT). But in this case the software engineer or web designer have to prepare the information in an accessible format [24]. In other situation of their life for example nowadays more and more user interface of household appliances is based on information screens using display panels. But visual disability persons cant read it. Gutirrez and co-workers developed a software system based on machine vision technology. The developed solution as AI runs on commercial mobile devices, like pocket PCs or smartphones. A digital camera of the mobile device captured an image of the display panel. This image is processed in order to detect the numeric, alphanumeric and iconographic information showed by the display. Finally the information is transmitted to the user using text to speech technology [7]. An infra-red handheld device was developed in Japan for helping blind people to cross the roads. When user wants to cross the roads, he/she judges the direction to go by utilizing the directivity of infrared beam using opposite a hand-held terminal [18]. People with visual disabilities have serious difficulties when mobilizing through the city on the public transportation system. The AudioTransantiago, a handheld application helps the users to plan trips and provide contextual information during the journey through the use of synthesized voices [22].
168
Attila Benko and Sik Lanyi Cecilia
7.2.2 Deaf and Hard-of-Hearing People The deaf people never get audio information. They would get the audio information using another information channel, like sign language or written text information. The other problem that people with impaired hearing may have a limited vocabulary. This is one of the problems with hearing-impaired people. Therefore new information and instructions must use simple language alongside cartoon-like presentation. They still require sounds to accompany the graphics. This also applies to anyone with any cognitive impairment [27]. A WEB based interpreter was developed at the University of Tunis. The aim of this tool is to interpret automatically texts in visualgestural-spatial language by using avatar technology [9]. A prototype based on automatic speech recognition was developed in Spain, it could be applied to generate automatically live subtitles as teletext for Spanish news broadcast without human participation. The main goal was to evaluate the feasibility of using this technology to improve the quality of life of millions of hearing-impaired people [14]. An interactive training system was developed at the Kagawa University. This speech training system for auditory impaired people is employing a talking robot. The talking robot consists of mechanically-designed vocal organs such as a vocal tract, a nasal cavity, artificial vocal cords, an air pump and a sound analyzer with a microphone system and the mechanical parts are controlled by 10 servomotors in total for generating human-like voices. The robot autonomously learns the relation between motor control parameters and the generated vocal sounds by an auditory feedback control, in which a selforganizing neural network is employed for the adaptive learning [10].
7.2.3 Physically Disabled Persons The problem of the physically impaired people not only to use wheelchair, but sometimes they have impaired fine motor ability. The biggest problem for people with impaired fine motor ability is using the input devices of Information Technology (IT). Additionally, the multimedia software must be accessible via the keyboard, therefore it must be easy to use and have a good keyboard navigation system. Thus the task is to find the optimal navigation method for the mobility-impaired user. If the user has not a special input device, navigation can be facilitated with a moving rectangle, the speed of which is adjustable, or through a voice-controlled navigation or command system [23]. In this sense, some efforts are considered, such as that presented in [28], in order to propose easy and intuitive methodologies and authoring tools for the implementation of generic dialogue systems to support natural interaction with humans. The L-Exos system, which is a 5-DoF haptic exoskeleton for the right arm, has been successfully clinically tested in a study involving nine chronic stroke patients with upper limb motor impairments in Italy. The device has demonstrated itself suitable for robotic arm rehabilitation therapy when integrated with a Virtual Reality (VR) system [6].
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
169
7.2.4 Mentally Disabled People There is a wide variation of cognitive impairments that could be categorized as memory, perception, problem-solving, and conceptualizing disabilities. Memory disabilities include difficulty obtaining, recognizing, and retrieving information from short-term storage, as well as long-term and remote memory. Dementia syndrome is one of the major challenges facing the quality of life of older people. Stephen Wey [29] has written on the role of assistive technology in the rehabilitation of people with dementia. Morandell and co-workers made an avatar for dementia patient. The idea behind using an avatar as Graphical User Interface (GUI) component for an User Interface (UI) for people with Alzheimers disease is, to bind the user emotionally towards the system. This is tried to be stressed by making the avatar looking like familiar and known informal or formal caregivers [13]. Photo realistic known talking heads can be a possibility to make GUIs more attractable for people with dementia. A care system was developed by the Japan Advanced Institute of Science and Technology, to monitor the behaviours of the residents at group homes and to protect them from accidents. This monitoring care system implies not only watching people with dementia but also supporting their autonomy to realize an adequate dementia care [26]. Standens study aimed to discover if repeated sessions playing a computer game involving aspects of decision making, such as collecting relevant information and controlling impulsivity, would improve performance in two non-computer based tests of decision making [25].
7.2.5 Smart Home We could follow these examples showing other assistive technology devices using AI and for example the nowadays very exciting smart home innovation technology. Smart home technologies are often included as a part of ubiquitous computing. Home technologies have tried to help home inhabitants since its creation. Nowadays, due to the popularization of computational devices, ubiquitous computing is called to be the revolution to develop smart systems with artificial intelligence techniques [20], [8]. The automation of smart environment systems is one of the main goals of smart home researching. Scientist of University of Sevilla focused on learning user lighting preference, considering a working field like a standard office. They made a review of the smart environment and devices setup, showing a real configuration for test purposes. In their research suitable learning machine techniques are exposed in order to learn these preferences, and suggested the actions, the smart environment should execute to satisfy the user preferences. This learning machine techniques proposed were fed with a database, so in their work a proposal for the vectorization of data is described and analyzed [5]. But build and install a smart home is a bit expensive today for west European countries too but it will be one of the possible solutions in the future to solve some problem of the aging populations. In this section we showed some examples how helping these innovative systems
170
Attila Benko and Sik Lanyi Cecilia
using AI for the special needs users. Another exciting big field is Alternative and Augmentative Communication (AAC) and rehabilitation of aphasic patient. In the next section we will show our AI for helping to teach speaking aphasic clients.
7.3 A detailed example of using AI in virtual reality for rehabilitation The term of aphasia means an acquired communicational abnormality. An aphasic patient has communication difficulties orally and in writing equally. The aim was to develop a system which helps the aphasic patients recovery. The created AI system represents a virtual therapy room, where the aphasic patient has to answer the virtual therapists questions and can obtain help from the virtual patients [2]. There are two main aphasia types and the place of injury and its greatness on the brain surface determines the type of the aphasia. Broca aphasia (motor): non-fluent speech, better speech understanding Wernicke aphasia (sensor): fluent speech, weak speech understanding For each of the avatars there has to be an algorithm for processing a natural language in order to be able to answer questions. Natural language processing is the subfield of the artificial intelligence. There is also has to be an artificially made grammatical system of the communication language based on the knowledge greatness of the aphasic patient. Lot of aphasic patients communicate with intensively limited number of words. That is the main point why the patient has to expand his or her vocabulary by practicing the communication with the avatars. The algorithm orders the words of the sentences into grammatical categories by the methods of parsing. The grammatical analysis term parsing came from the Latin pars-orationis means that the single part of the speech are assigned to the every single word of a sentence, and we group the words into expressions. One of the representation manners of the result of the syntactic analysis is concerned the derivation tree (parse tree, see Figure 7.1). The syntax tree representation: Structure for the connections of the terminals and non-terminals coherent for the current sentence the terminals are the words (see in Fig. 1. e.g. breakfast) non-terminals are grammar categories (see in Fig. 1. e.g. V verb) The full sentence can be obtained by the comparison of the letters implying the character strings. It was necessary to create formal grammar, to formalize the communication between the avatars and the patients. The aphasic patients current speech ability determines how difficult should be the formal grammar. The formal grammar is in Backus-Naur form (BNF), where we assign the words to the nouns, verbs, adjectives, articles, etc. like elements. The knowledge base of the avatars can be summarized as it can be seen in the following charts (Figure 7.2 and Figure 7.3). Emotional model was described by a deterministic finite automaton (DFA representation method). To the exact description of the emotional states and their transi-
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
Fig. 7.1 Example for parse tree
Fig. 7.2 First level task
Fig. 7.3 Second level task
171
172
Attila Benko and Sik Lanyi Cecilia
tions we have to know the definition of the deterministic finite automata, built upon this definition already grantable the finite automaton of the avatar. The M = (K, Σ , δ , s, F) quintuplet is a finite DFA if the followings are true: M = (K, Σ , δ , s, F). K represents the finite set of states; Σ finite set of characters; δ transit function from K×Σ to K; s is the start state and F is the finite set of final states. The emotional states of the avatar are: K = {neutral, happy, sad, fear, anger, boring}. The avatar is in a state without feelings initially: s = neutral.
Fig. 7.4 A DFA model for representing the emotion system
Let: M := (K = {t0 ,t1 ,t2 ,t3 ,t4 ,t5 ,t6 ,t7 }, Σ = {p0 , p1 , p2 , p3 , p4 , p5 , p6 , p7 },δ , s = t0 , F = {t4 } ) Where: • • • • • • •
t0 : neutral state, this is the avatars start state t1 : natural language processing state t2 : happy state t3 : sad state t5 ,t6 ,t7 : can be defined with emotions like in: t2 ,t3 t4 : executing commands-state p1,...,7 : conditions for the state-transit function
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
173
In summary with the above algorithm and representation form of the avatars the recovery of the aphasic patient can be aided by practicing the communication in written form and expressing emotions by interacting with these emotional communicative avatars. The 3D modeling of the avatars and the virtual therapy room were created by Maya software and Photoshop so it occurred likewise (see in Figure 7.5 and Figure 7.6). The creation method resulted a ready-to-use software and it is also useful for the patients that have moving disabilities because they can practice the communication with the avatars at home.
Fig. 7.5 Maya has been used to create the virtual therapy room
Fig. 7.6 Textures for the emotions and the rendered mesh of the avatar for a virtual patient
174
Attila Benko and Sik Lanyi Cecilia
The virtual therapy rooms usability was tested not only by healthy people but aphasic patient too.
7.4 Future vision Over the last ten years, the technology for creating virtual humans has evolved to the point where they are no longer regarded as simple background characters, but rather can begin serve a functional interactional role. More recently, seminal research and development has appeared in the creation of highly interactive AI (and natural language capable) virtual human (VH) agents. No longer at the level of a prop to add context in a virtual world, these VH agents are being designed to perceive and act in a 3D virtual world, engage in face-to-face spoken dialogues with real people and other Vhs in such worlds, and they are becoming more capable of exhibiting human-like emotions. [19] According the Rizzo, the VH with emotion based on AI will the future helping function of future HCI. Augmenting the multimedia and virtual reality programs with devices such as data gloves is still expensive, but with the evolution of the VR game industry more and more new equipment will be on the market and will hopefully become cheaper. In this case the old but usable PC-s and game hardware could be used at home or in the clinic for rehabilitation of disabled people. VR researchers are investigating the possibility of delivering VE over the Internet. Eventually it will be possible to offer VR services to patients under the supervision of special teachers, doctors or therapists in their homes. In this case the patient and the therapist could go into the same VE at the same time, or hopefully in the near future more patients could work collaboratively with their therapist in the same VE at the same time [23]. Some contributions have been developed in order to provide customizable multimedia and collaborative virtual environments, where a patient would be able to interact with his therapist or with other patients by means of communication tools inside the VE, and also being able to explore and acquire information, improving his treatment with the help of the multimedia content [21]. The emotional model presented in Section 3, described by a deterministic finite automaton (DFA) will be improved with applaying methods and tools based on EFSM (extended finite state machine) to make in evidence the predicates for the nature of personalization of collaboratively work with therapists. For this we extend the common middleware services by domain-specific services applying the SDL Macro-patterns method from [12], that offers EFSM-based methods for the analysis steps that focus on reuse and refactoring with design patterns from generic implementation frameworks, and on testability before the code generation. In the context of a patterns-integrated architecture, the identification and mapping of roles in the therapists-domain becomes easier, and the concept space of patterns helps in translating and integrating them into VTR implementation framework. Moreover the above mentioned VH with AI would be a useful intelligent system helping the elder denegation.
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
175
7.5 Conclusion One of the problems of the highly developed economy countries is the elderly population. Unfortunately there is a correlation between the age and the health and disabilities. The problem will be serious in the next 20-30 years if we see the demographic data. There will be not enough social workers, nurse, care taker etc therefore for the independent life we will need more assistive technology, and IT based helping systems. Most of these systems based on AI. We showed some examples of the newest developed multimedia and VR systems helping the special needs users (visually impaired, partially sighted, deaf, hard of hearing, mentally retarded, and physically disabled users etc) We showed our newest research Virtual Therapy Room (VTR) a realized application for aphasic patients that was created for practicing communication and expressing emotions in a group therapy setting.
7.6 Acknowledgement The authors would like thank Dr. Jacqueline Ann Stark and the Austrian Science and Research Liaison Office (project number: 2007.ASO-N/4/5) for their support in the development of the Virtual Therapy Room.
References 1. Azkoitia, J.M.: Ageing, Disablity and Technology. Challenges for Assistive Technology G. Eizmendi et al. (Eds.) IOS Press 2007, pp. 3-7 (2007) 2. Benk, A., Sik Lnyi C., Stark J.: Interacting via a Virtual Language Therapy Room. ComputerBased Intervention and Diagnostic Procedures - Applications for Language-Impaired Persons Workshop, 7-8 July 2008, Vienna, Austria, pp. 19-22 (2008) 3. Census 2000. Census report on disability.: http://www.who.int/healthmetrics/ tools/logbook/en/countries/zmb/2000_Census_Report_on_ Disability.pdf 4. Encheva, S., Tumin, S., Sampaio, P.N.M., Rodriguez Peralta, L.M.: On Multimedia Factors Effecting Learning. Proc. of ED-MEDIA2007 World Conference on Educational Multimedia, Hypermedia & Telecommunications. Canada, June (2007) 5. Fernndez-Montes, A., Ortega, J.A., lvarez, J. A., Cruz., M.D.: Smart Environment Vectorization An Approach to Learning of User Lighting Preferences, Lovrek, R.J. Howlett, and L.C. Jain (Eds.): KES 2008, Part I, LNAI 5177, Springer-Verlag Berlin Heidelberg 2008, pp. 765772 (2008) 6. Frisoli, A., Bergamasco, M., Borelli, L., Montagner, A., Greco, G., Procopio, C., Carboncini, M.C., Rossi, B.: Robotic assisted rehabilitation in virtual reality with the L-EXOS. Proc. of 7th ICDVRAT with ArtAbilitation , Maia, Portugal, pp. 253-260 (2008) 7. Gutirrez, J.A., Gonzlez, F.J., Picn, A., Isasi, A., Domnguez, A., Idigoras, I.: Electronic Display Panel Mobile Reader. Challenges for Assistive Technology, G. Eizmendi et al. (Eds.), IOS Press, pp. 320-325 (2007) 8. Hangos, K.M., Lakner, R., Gerzson, M..: Intelligent control system. An introduction with examples, Kluwer Academic Publisher, pp.1-301 (2001)
176
Attila Benko and Sik Lanyi Cecilia
9. Jemni, M., Elghoul, O.: An Avatar Based Approach for Automatic Interpretation of Text to Sign Language. Challenges for Assistive Technology, G. Eizmendi et al. (Eds.), IOS Press, pp. 266-270 (2007) 10. Kitani, M., Hayashi, Y., Sawada, H.: Interactive training of speech articulation for hearing impaired using a talking robot. Proc. of 7th ICDVRAT with ArtAbilitation, Maia, Portugal, pp. 293-301 (2008) 11. Manton, K.G.:The Scientific and Policy Needs for Improved Health Forecasting Models for Elderly Populations, Forecasting the Health of Elderly Populations, Manton K.G., Singer B.H., Suzman R.M. Eds, Springer Verlag, pp. 3-35, (1993) 12. Medve, A. Ober, I.: From Models to Components: Filling the Gap with SDL Macro-patterns. Proc. of IEEE International Conference on Innovation in Software Engineering, Ed. Mohammadian, M., ISE2008, 1012 December ,Vienna, Austria, (2008) 13. Morandell, M., Fugger, E., Prazak, B.: The Alzhemier Avatar Caregivers Faces Used as GUI Componenet. Challenges for Assistive Technology, G. Eizmendi et al. (Eds.), IOS Press, pp. 243-247 (2007) 14. Obach, M., Lehr, M., Arruti, A.: Automatic Speech Recognation for Live TV Subtitling for Hearing-Impaired People. Challenges for Assistive Technology, G. Eizmendi et al. (Eds.), IOS Press, pp. 286-291, (2007) 15. Maintaining Prosperity In An Ageing Society: the OECD study on the policy implications of ageing - Long term care services to older people: a perspective on future needs http: //www.oecd.org/dataoecd/21/43/2429142.pdf 16. OECD population pyramids in 2000 and 2050: http://www.oecd.org/ LongAbstract/0,3425,en_2649_33933_38123086_1_1_1_1,00.html 17. Trends in Severe Disability Among Elderly People (Health Working Paper No. 26): http: //www.oecd.org/dataoecd/13/8/38343783.pdf 18. Ohkubo, H., Kurachi, K., Takahara, M., Fujisawa, S., Sueda, O.: Directing Characteristic of Infra-red Handheld Device toward Target by Persons with Visual Impairment. Challenges for Assistive Technology, G. Eizmendi et al. (Eds.), IOS Press, pp. 336-340 (2007) 19. Rizzo, A.A.: Virtual reality in psychology and rehabilitation: the last ten years and the next!. Proc. 7th ICDVRAT with ArtAbilitation, Maia, Portugal, pp.3-8 (2008) 20. Russel, S., Norvig, P.: Artificial intelligence. A modern approach, Prentice- Hall Inc., pp.1-932 (1995) 21. Sampaio, P.N.M.; de Freitas, R.I.; Cardoso, G.N.P. OGRE-Multimedia: An API for the design of Multimedia and Virtual Reality Applications. In proceedings of the 12th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems KES2008. Springer-Verlag in Lecture Notes in Computer Science, Zagreb, Croatia, September 3-5, (2008) 22. Snchez, J.H., Oyarzn, C.A.: Mobile audio assistance in bus transportation for the blind. Proc. of 7th ICDVRAT with ArtAbilitation, Maia, Portugal, pp. 279-286 (2008) 23. Sik Lnyi, C. (2006). Multimedia medical informatics system in healthcare. In A. Ichalkaranje et al. (Eds.), Intelligent paradigms for assistive and preventive healthcare, Berlin: SpringerVerlag, pp. 39-91, (2006) 24. Sik Lanyi C., Forrai S., Czank N., Hajgat A.: On Developing Validator Software Xvalid for Testing Home Pages of Universal Design. Universal Access in HCI, PART I, HCII 2007, Lecture Notes in Computer Science, LNCS 4554, pp. 284-293 (2007) 25. Standen, P.J., Rees, F., Brown, D.: Effect of playing computer games on decision making in people with intellectual disabilities. Proc. of 7th ICDVRAT with ArtAbilitation, Maia, Portugal, pp.25-32 (2008) 26. Sugihara, T., Nakagawa, K., Fujinami, T., Takatsuka, R.: Evaluation of a Prototype of the Mimamori-care System for Persons with Dementia. I. Lovrek, R.J. Howlett, and L.C. Jain (Eds.): KES 2008, Part II, LNAI 5178, Springer-Verlag Berlin Heidelberg, pp. 839846, (2008) 27. Sik Lnyi, C.: Multimedia Software Interface Design for Special Needs Users. Encyclopedia of Information Science and Technology, 2nd Edition, IGI Global, (2008)
7 Ambient Intelligence in Multimedia and Virtual Reality Environments
177
28. Quintal M.M.L; Sampaio P.N.M.: A Methodology for Domain Dialogue Engineering with the Midiki Dialogue Manager. In Text, Speech and Dialogue Proceedings of the 10th International Conference TSD 2007. Springer-Verlag in Lecture Notes in Computer Science, Plzen, Czech Republic, September, (2007) 29. Wey, S.: One size does not fit all: person-centred approaches to the use of assistive technology. in M. Marshall (Ed.) Perspectives on rehabilitation and dementia, Jessica Kingsley Publisher, London, pp. 202-208 (2006)
Chapter 8
Artificial Neural Networks for Processing Graphs with Application to Image Understanding: A Survey Monica Bianchini and Franco Scarselli
Summary. In graphical pattern recognition, each data is represented as an arrangement of elements, that encodes both the properties of each element and the relations among them. Hence, patterns are modelled as labelled graphs where, in general, labels can be attached to both nodes and edges. Artificial neural networks able to process graphs are a powerful tool for addressing a great variety of real–world problems, where the information is naturally organized in entities and relationships among entities and, in fact, they have been widely used in computer vision, f.i. in logo recognition, in similarity retrieval, and for object detection. In this chapter, we propose a survey of neural network models able to process structured information, with a particular focus on those architectures tailored to address image understanding applications. Starting from the original recursive model (RNNs), we subsequently present different ways to represent images – by trees, forests of trees, multiresolution trees, directed acyclic graphs with labelled edges, general graphs – and, correspondingly, neural network architectures appropriate to process such structures. Key words: Neural Networks, Image Understanding
8.1 From flat to structural Pattern Recognition Pattern recognition algorithms and statistical classifiers, such as neural networks or support vector machines (SVMs), can deal with real–life noisy data in an efficient Monica Bianchini Department of Information Engineering, University of Siena, Via Roma, 56 – 53100 Siena, ITALY e-mail:
[email protected] Franco Scarselli Department of Information Engineering, University of Siena, Via Roma, 56 – 53100 Siena, ITALY e-mail:
[email protected] E. Damiani and J. Jeong (eds.), Multimedia Techniques for Device and Ambient Intelligence, DOI: 10.1007/978-0-387-88777-7_8, © Springer Science + Business Media, LLC 2009
180
Monica Bianchini and Franco Scarselli
way, so that they can be successfully applied in several domains. Historically, such models were able to process data codified as (sequences of) real vectors of finite and fixed dimensionality. Nevertheless, in a great variety of real–world problems, the process of extracting relevant information cannot disregard relationships that exist among atomic data, so that applying traditional data mining methods implies an extensive preprocessing phase. For instance, categorical variables are encoded by one–hot encoding, time series can be embedded into finite dimensional vector spaces using time windows, preprocessing of images includes edge detection and use of various filters, chemical compounds can be described by topological indices and physiochemical attributes. In all these cases, tree/graph representations can be exploited to represent patterns in more natural ways, whereas significant information is usually lost when complex data structures of arbitrary size are encoded in fixed– dimension flat representations. On the other hand, the truly subsymbolic nature of many of those problems makes it very hard to extract clean structured symbolic data. The way people process this kind of information can be regarded neither as strictly symbolic nor as subsymbolic, neither sequential nor parallel. The human brain is a complex graph of elementary neurons, and also the data to be processed can often be regarded as complex structures of elementary units. In fact, a structured pattern can be thought of as an arrangement of elements, deeply dependent on the interactions among them, and by the intrinsic nature of each element. Hence, the causal, hierarchical, and topological relations among parts of a given pattern yield significant information. In the last few years, some new models, which exploit the above definition of pattern as an integration of symbolic and sub–symbolic information, have been developed. These models try to solve one of the most challenging tasks in pattern recognition: obtain a flat representation for a given structure (or for each atomic element that belongs to a structure) in an automatic, and possibly adaptive, way. This flat representation, computed following a recursive computational schema, takes into account both the local information associated to each atomic entity and the information induced by the topological arrangement of the elements, inherently contained into the structure. With respect to Markov models, Random Walk (RW) techniques have been recently proposed in [1] that can compute the relevance for each node in a graph, where the relevance depends on the topological information collected in the graph structure and on the information associated to each node. The relevance values, computed using an RW model, have been used in the past to compute the ranking of the Web pages inside search engines (and Google uses a ranking technique based on a particular RW model). Classical problems related to graph theory, like graph or subgraph matching, can also be addressed in this framework. On the other hand, support vector machines [2, 3] are among the most successful recent developments within the machine learning and the data mining communities. The computational attractiveness of kernel methods comes from the fact that they can be applied into high dimensional feature spaces without suffering the high cost of explicitly computing the mapped data. In fact, the kernel trick consists in defining a positive–definite
8 Artificial Neural Networks for Processing Graphs
181
kernel so that a set of non linearly separable data can be mapped onto a larger metric space, on which they become linearly separable, without explicitly knowing the mapping between the two spaces [4]. Using a different kernel corresponds to a different embedding and thus to a different hypothesis language. Crucial to the success of kernel–based learning algorithms is the extent to which the semantics of the domain is reflected in the definition of the kernel. In recent years, also supervised neural networks have been developed which are able to deal with structured data encoded as labelled directed positional acyclic graphs (DPAGs). These models are called recursive neural networks (RNNs) and are fully described in [5, 6, 7]. The essential idea of recursive neural networks is to process each node of an input graph by a multilayer perceptron, and then to process the DPAG from its leaf nodes toward the root node (if any, otherwise such node must be opportunely added [5]), using the structure of the graph to connect the neurons from one node to another. The output of the neurons corresponding to the root node can then be exploited to encode the whole graph. In other words, in order to process an input DPAG, the RNN is unfolded through the graph structure, extending to graphical structures the traditional “unfolding” process adopted by recurrent neural networks for sequences [8]. The main limitation of this model is inherently contained in the kind of structures that can be processed. In fact, it is not always easy to represent real data using DPAGs. In this kind of graphs, each edge starting from a node has an assigned position, and any rearrangement of the children of a node produces a different graph. While such assumption is useful for some applications, may sometimes introduce an unnecessary constraint on the representation. For example, this hypothesis is not suitable for the representation of a chemical compound and might not be adequate for several pattern recognition problems. In [9], a new model able to process DAGs was presented. This model exploits a weight–sharing approach in order to relax the positional constraint. Even if interesting from a theoretical point of view, this methodology has limited applications. In fact, the complexity of the network architecture grows exponentially with the maximum outdegree of the processed structures. A different way of relaxing the positional constraint is presented in [10, 11]. This approach assumes to process directed acyclic graphs with labels also on the edges (DAGs–LE). The state of each node depends on the label attached to the node and on a particular combination of the contributions of its children weighed by the edge labels. This total contribution can be computed using a feedforward neural network or an ad hoc function, and is independent both from the number and the order of the children of the node. Therefore, the model allows to process graphs with any outdegree. Moreover, since determining useful features that can be associated to the edges of the structures is normally difficult, in [12] a procedure, which allows to transform a DPAG into a DAG–LE, is presented. In order to process also cyclic graphs, in [13], a collapse strategy is proposed for cycles, which are represented by a unique node, that collects the whole information attached to the nodes belonging to the cycle. Unfortunately, this strategy cannot be carried out automatically and it is intrinsically heuristic. A different technique to process cyclic structure is proposed in [14, 15]. This method
182
Monica Bianchini and Franco Scarselli
performs a preprocessing of the cyclic structures, which transforms each graph into a forest of recursive–equivalent trees. The forest of trees collects the same information contained in the cyclic graph. This method allows both to process cyclic and undirected graphs. In fact, undirected structures can be transformed into cyclic directed graphs by replacing each undirected edge with a pair of directed arcs with opposite directions. Finally, the GNN (Graph Neural Network) model [16, 17] is able to process general graphs, including directed and undirected structures, both cyclic and acyclic. In the GNN model, the encoding network can be cyclic and the nodes are activated until the network reaches a steady state. All the models cited above were defined inside the supervised learning paradigm. Supervised information, however, either may not be available or may be very expensive to obtain. Thus, it is very important to develop models which are able to deal with structured data in an unsupervised fashion. In the last few years, some RNN models were proposed also in the framework of unsupervised learning [18], and various unsupervised models for non–vectorial data are available in literature. The approaches presented in [19, 20] use a metric for Self–Organizing Maps (SOMs) that directly works on structures. Structures are processed as a whole by extending the basic distance computation to complex distance measures for sequences, trees, or graphs. Early unsupervised recursive models, such as the temporal Kohonen map or the recurrent SOM, include the biologically plausible dynamics of leaky integrators [21, 22, 23]. This idea has been used to model direction selectivity in models of the visual cortex and for time series representation [22, 23, 24]. From a practical point of view, neural network models able to process structured data have been widely used in computer vision, f.i. in logo recognition [25], for the definition of a similarity measure useful for browsing image databases [26], and for the localization and detection of the region of interest in colored images. In [27, 28], a combination of RNNs for cyclic graphs and RNNs for DAGs–LE was exploited to locate faces, while an extension of the same model was proposed in [10, 29, 30] in order to detect general objects. In all the above cited applications, the fundamental hypothesis is that of codifying images by graphs, in which the labels attached to each node describe the feature of a particular region of the image (represented by the node), whereas links accounts for particular relationships between regions (tipically, inclusion or adjacency). The chapter is organized as follows. In the next section, neural network models able of processing graphs are presented. In Section 8.3, the graph–based representation of images is described, starting from the segmentation process, and defining several different types of data structures that can appropriately collect the perceptual/topological information extracted from images. Moreover, in the same section, some applications of recursive models to image processing are briefly recalled, in order to give suggestions on how to represent images/how to choose an ad hoc architecture based on the particular image understanding task to be faced. Finally, Section 8.4 collects some concluding remarks.
8 Artificial Neural Networks for Processing Graphs
183
Fig. 8.1 An image represented by its Region Adjacency Graph: nodes denote homogeneous regions and edges (represented by dashed lines) stand for the adjacency relationship.
8.2 Graph processing by neural networks In this section, we review some connectionist models for graph processing, with a particular attention to those approaches that have been used for image classification and object localization in images, i.e., Graph Neural Networks [16] and Recursive Neural Networks [5, 6].
8.2.1 Notation In the following, a graph G is a pair (N,E), where N is a set of nodes (or vertexes), and E ⊆ {(u, v)|u, v ∈ N} is a set of edges (or arcs) between nodes. The set ne[n] collects the neighbors of n, i.e. the nodes connected to n by an arc, while co[n] denotes the set of arcs having n as a vertex. Nodes and edges may have labels, that describe the features of the object represented by a node and the features related to the relationships between objects/nodes, respectively. For example, in the case of Fig. 8.1, where the image is represented by a Region Adjacency Graph, node labels may define properties of the regions (e.g., area, perimeter, color, etc.), while edge labels may specify the relative position of the regions (e.g. the distance between their barycenters). The labels attached to node n and edge (n1 , n2 ) will be represented by l n ∈ IRlN and l (n1 ,n2 ) ∈ IRlE , respectively. The considered graphs may be either positional or non–positional. A graph is said to be positional or non–positional according to whether a function νn exists
184
Monica Bianchini and Franco Scarselli
for each node n that assigns to each neighbor u of n a different position νn (u). For instance, in Fig. 8.1, νn may be used to represent the relative spatial position of the regions, e.g., νn (u) may be 1, 2, 3 or 4 according to whether the region represented by u is over, on the right, under, or on the left of the region denoted by n.
8.2.2 A general framework for graph processing An intuitive idea supports most of the connectionist approaches to graph processing: A graph represents a set of objects (concepts) and their relationships. More precisely, nodes stand for objects and edges represent their binary relationships. In order to store a representation of the objects, a state xn ∈ IRs is specified for each node n. Moreover, since every concept is naturally defined by its features and the related concepts, we can assume that xn depends on the information contained in a neighborhood of n. Formally, a parametric function fw , called local transition function, expresses the dependence of a node n on its neighborhood: xn = fw (l n , l co[n] , xne[n] , l ne[n] ) ,
(8.1)
where l n , l co[n] , xne[n] , l ne[n] are respectively the label of n, the labels of its edges, and the states and the labels of the nodes in the neighborhood of n (see Fig. 8.2). Finally, an output on may also be defined, which depends on the node state and the node label, according to a parametric local output function gw : on = gw (xn , l n ) .
(8.2)
Thus, Eqs. (8.1) and (8.2) specify a parametric model that computes an output on = ϕw (G, n) for any node n of the graph G, considering all the information in G. Interestingly, the functions ϕw that can be implemented in this way are not significantly restricted by the fact that fw and gw can access only information locally available at each node. Actually, it was proved that, under mild assumptions, a large class of continuous functions on graphs can be approximated in probability, up to any degree of precision, by the above model [31]. In this framework, both supervised and unsupervised approaches have been proposed. In the former class of methods, the training set L is defined as a set of triples L = {(Gi , ni, j ,t ni, j )| 1 ≤ i ≤ p, 1 ≤ j ≤ qi }, where each triple (Gi , ni, j ,t ni, j ) denotes a graph Gi , one of its nodes ni, j and the desired output at that node, t ni, j . Moreover, p is the number of graphs in L and qi is the number of supervised nodes in graph Gi , i.e. the nodes for which a desired target exists. The goal of the learning procedure is that of adapting the parameters w so that ϕw approximates the targets on the supervised nodes. In practice, the learning problem is implemented by the minimization of a quadratic error function p
qi
ew = ∑ ∑ (t ni, j − ϕw (Gi , ni, j ))2 , i=1 j=1
(8.3)
8 Artificial Neural Networks for Processing Graphs
185
Fig. 8.2 A graph and the neighborhood of a node (node 1). The state x1 of node 1 depends on the information contained in its neighborhood.
which can be achieved by a gradient descent technique. In the unsupervised setting, the targets are not available and the training set just consists of a set of graphs L = {Gi | 1 ≤ i ≤ p}. The goal, in this case, is to auto–organize the concepts described by the training set. Such a goal can be achieved by finding a set of parameters so that two nodes ni,k , n j,k have close outputs, i.e. ϕw (Gi , ni,k ) ≈ ϕw (G j , n j,h ), if and only if they represent similar concepts. The transition function fw and the output function gw are implemented by static networks, e.g., multilayer perceptrons, cascade correlation networks or selforganizing maps [32]. In this way, Eqs. (8.1) and (8.2) define a large neural network, called encoding network. In fact, the encoding network has the same topology of the input graph, since it is obtained by substituting all the nodes of G with f –units, that compute the function fwn . The units are connected according to the graph topology (Fig. 8.3). The f –units calculate the states locally at each node. The information is diffused through the encoding network following the connections defined by the edges. For the nodes where the output is computed, the f –unit is also connected to a g–unit, that implements the output function gwn . Actually, as clarified below, the
186
Monica Bianchini and Franco Scarselli
Fig. 8.3 A graph and the corresponding encoding network. The edge directions indicate the dependencies between the data. The computation is carried out at each node by f –units, and the information is diffused according to the graph connectivity. A g–unit computes the output at nodes 2 and 3.
encoding network can be used both to compute the states and to adapt the parameters. Notice that Eq. (8.1) is well suited for positional graphs, since the position of the child u of the node n is naturally encoded by the position of its state xu in the vector xne[n] (the position of l n in l ne[n] ). Moreover, if fwn is implemented by a static neural network, then the number of inputs to the network (i.e., the maximum number of the children of each node) must be bounded and fixed in advance. When those conditions are too restrictive, it is useful to replace Eq. (8.1) with xn =
∑
hw (l n , l (n,u) , xu , l u ), n ∈ N ,
(8.4)
u∈ne[n]
where hw is a parametric function which can be realized by a static neural network. This transition function, which has been successfully used in recursive neural networks [10], is not affected by the positions and the number of the children of a node. GNNs, RNNs and other models adopting this framework differ for the set of graphs that they can process, for the implementation of the transition functions and for the learning algorithm they employ. In the following, we review the peculiarities of some of these models.
8.2.3 Recursive Neural Networks Recursive neural networks [5, 6] are a supervised model with the following characteristics: 1. The input graph must be acyclic and directed;
8 Artificial Neural Networks for Processing Graphs
187
2. There is a supersource s from which all the other nodes can be reached; the output is computed only in correspondence with the supersource. 3. The inputs to fw include only l n and xch[n] , where ch[n] is the set of the children of n; 4. The functions fw and gw are implemented by single layer perceptrons. Note that Eq. (8.1), without any constraint, may define cyclic dependencies of a state on itself, both for the presence of cycles in the input graph, and since a state xn depends on the states xne[n] of its neighbours, which, in turn, depend on xn . On the other hand, points 1 and 3 exclude those dependencies, so that, in the recursive framework, the encoding networks are feedforward. Thus, the gradient ∂∂eww can be calculated by a common BackPropagation procedure [33]. More precisely, the states xn are evaluated following the natural order defined by the edges: first the states of the leaf nodes, then the states of their parents, and, so on, until the state of the root is obtained. Then, the output is produced by gwn . The gradient computation procedure follows the converse direction and backpropagates the error signal from the supersource to the leaves of the graph. The different contributions to the gradient due to the various replicas of the same weight in the econding networks are then accumulated to appropriately update each weight in the recursive network. It is worth mentioning that, while, in the original version of RNNs [5], fw and gw were implemented by single layer feedforward neural networks, different kinds of supervised models have subsequently been exploited. For example, in [13], RNNs with cascade correlation networks have been proposed along with a modified learning algorithm. Moreover, the non–positional implementation of Eq. (8.4) is adopted in [10], where hw is also realized as an ad hoc combination of the edge labels and the children states.
8.2.4 Graph Neural Networks Graph neural networks have the following peculiarities: 1. The input graph can be either cyclic or acyclic; 2. The transition and the output function are implemented by multilayered neural networks. Thus, GNNs can also cope with cyclic dependencies among the node states and do not limit either the graph domain or the parameters of the transition function. Without the constraints of RNNs, Eq. (8.1) may have any number of solutions and the outputs on may not be uniquely defined. GNNs use the Banach Fixed–point Theorem to solve such a problem and to ensure the existence and uniqueness of the solution. Let Fw and Gw be the vectorial function constructed by stacking all the instances of fw and gw , respectively. Then Eqs. (8.1) and (8.2) become x = Fw (x, l) ,
o = Gw (x, l) ,
(8.5)
188
Monica Bianchini and Franco Scarselli
where l represents the vector collecting all the labels of the input graph, and x contains all the states. The Banach Fixed–point Theorem states that if Fw is a contraction map1 , then Eq. (8.5) has a solution and the solution is unique [34]. In practice, in the GNN model, a penalty term p(Fw ), which measures the contractivity of Fw , is added to the error function ew . In so doing, the parameters w are forced to remain in the domain where Fw is a contraction map. The Banach Fixed–point Theorem also suggests a method to compute the states. In fact, the theorem establishes that, if Fw is a contraction map, then the states can be simply computed by an iterative application of their definition, i.e. by the following dynamical system xn (t) = fw (l n , xch[n] (t − 1), l ch[n] ),
n∈N,
(8.6)
where xn (t) is the t–iterate of xn . Moreover, the theorem proves that the convergence is exponentially fast and does not depend on the initial state. The computation is stopped when the state change becomes small, i.e. when kx(t) − x(t − 1)k ≤ ε for a vectorial norm k · k and a predefined small real number ε. Finally, in order to design a gradient descent learning algorithm, we can observe that each iteration of Eq. (8.6) corresponds to an activation of the f –units in the encoding network. Actually, in GNNs the encoding network is a system having a settling behavior and, for this reason, the gradient can be computed using the Almeida–Pineda algorithm [35, 36]. In fact, GNNs compute the gradient by a combination of the BackPropagation Through Structure algorithm, adopted by RNNs, and the Almeida–Pineda algorithm. More details on GNNs can be found in [16].
8.2.5 Other models Several other connectionist models belong to the framework defined by Eqs. (8.1) and (8.2). Common recurrent neural networks, that process sequences, are the simplest approach in this class. In fact, a sequence of real vectors l 1 , l 2 , . . . can be represented as a list of nodes, whose labels are the input vectors. Each node stands for a time instance in the computation of the network. In fact, a reccurrent network is a dynamical system whose state xt at time t depends on the state and the input a time t − 1, i.e., xt = fw (l t−1 , xt−1 ). Interestingly, in [37], it is proposed a bi–recursive recurrent network model that processes the input sequence, firstly, in the forward direction (time t is increasing) and, secondly, in the backward direction (time t is decreasing). Such an approach approximates the model defined by Eqs. (8.1) and (8.2) when fw takes the information contained in both the previous and the next nodes of the list as input. On the other hand, a model alternative to RNNs, called relational neural network, has been proposed in [38]. Relational neural networks differ from RNNs for the 1 A generic function ρ : IRn → IRn is said to be a contraction map if, for any norm k · k, there exists a real number µ, 0 ≤ µ < 1, such that for all x1 , x2 ∈ IRn , kρ(x1 ) − ρ(x2 )k ≤ µkx1 − x2 k.
8 Artificial Neural Networks for Processing Graphs
189
transition function they use, and for the parameter sharing among the nodes. In fact, the transition function has the same form as in Eq. (8.4) except that the terms hw (l n , l (n,u) , xu , l u ) are combined by a recurrent network instead of being summed. Moreover, the parameters w are not the same for each node, but different sets of parameters exist and each set is shared by a homogeneous2 group of nodes. Finally, unsupervised counterparts of RNNs have been also proposed in [18]. In RAAM machines [39] and SOM for structures [40], which can cope with directed acyclic graphs, the transition functions are a neural autoassociator and a self– organizing map, respectively. SOM for structures have been extended also to cyclic graphs [41]. Similarly to GNNs, the cycles are processed by iterating the activation of the transition function, even if, in this case, the convergence to a fixed point cannot be formally guaranteed.
8.3 Graph–based representation of images The neural network models described in the previous section are able to process structured data. Therefore, exploiting such models for image understanding tasks (classification, localization or detection of objects, etc.) requires a preprocessing phase during which a graph representation is extracted from each image. As a matter of fact, in the last few years, graph–based representations of images received a growing interest, since they allow to collect, in a unique “pattern”, both symbolic and structural information. The first step in the encoding procedure, aimed at representing images as structured data, consists in extracting a set of homogeneous regions, each one described by a set of attributes, appropriately chosen. In the following, we will review several segmentation methods, commonly used to extract the set of homogeneous regions, and some graphical structures, particularly suited to represent images.
8.3.1 Image segmentation Segmentation can be referred both to the process of extracting a set of regions with homogeneous features, and to the complex procedure aimed at determining the boundary of the objects depicted in an image, better named as “object segmentation”. Therefore, in the first acception, segmenting an image means dividing it into different regions, such that each region is homogeneous w.r.t. some relevant characteristics, while the union of any pair of adjacent regions is not. A theoretical definition of segmentation, reported in [42] is: 2
Since relational neural networks are usually applied on relational databases, the nodes represent the table rows and are homogeneous if they belong to the same table.
190
Monica Bianchini and Franco Scarselli
If P() is a homogeneity predicate defined on groups of connected pixels, then a segmentation is a partition of the whole set of the pixels F into connected subsets or regions, S1 , S2 , ..., Sn , ∪ni=1 Si = F , Si ∩ S j = 0/ , (i 6= j). such that the predicate P(Si ), that measures the homogeneity of the set Si , holds true for each region, whereas P(Si ∪ S j ) is false if Si and S j are adjacent. The segmentation phase is crucial for image analysis and pattern recognition systems, and very often it determines the quality of the final result. Nevertheless, according to [43], “the image segmentation problem is basically one of psychophysical perception, and therefore not susceptible to a purely analytical solution”. Thus, a universal theory on image segmentation does not yet exist, being all the methods known in literature strongly application–dependent (i.e. there are no general algorithms that can be considered effective for all types of images). Segmentation algorithms can be divided into two main categories, based on images to be processed, monochrome or color. Color segmentation attracted a particular attention in the past few years, since color images usually provide more information w.r.t. grey level images; however, color segmentation is a time–consuming process, even considering the rapid evolution of computers and their computational capabilities. The main color image segmentation methods can be classified as follows: • Histogram thresholding: This technique, widely diffused for grey level images, can be directly extended to color images. The color space is divided w.r.t. color components, and a threshold is considered for each component. However, since the color information is described by tristimulus R, G, and B, or by their linear/nonlinear transformations, representing the histogram of a color image and selecting an effective threshold is a quite challenging task [44]. • Color space clustering: The methods belonging to this class generally exploit one or more features in order to determine separate clusters in the considered color space. “Clustering of characteristic features applied to image segmentation is the multidimensional extension of the concept of thresholding” [43]. Applying the clustering approach to color images is a straightforward idea, because colors naturally tend to form clusters in the color space. The main drawback of these methods lies in appropriately determining the number of clusters in an unsupervised manner. • Region based approaches: Region based approaches, including region growing, region splitting, region merging and their combinations, attempt to group pixels into homogeneous regions. In the region growing approach, a seed region is first selected, and then expanded to include all homogeneous neighbors. Region growing is strictly dependent from the choice of the seed region and from the order in which pixels are examined. On the contrary, in the region splitting approach, the initial seed region is the whole image. If the seed region is not homogeneous, it is divided into four squared subregions, which become the new seed regions. This process is carried out until the obtained (squared) regions are homogeneous. The
8 Artificial Neural Networks for Processing Graphs
191
region merging approach is often combined with region growing and splitting with the aim of obtaining homogeneous regions as large as possible. • Edge detection: In monochrome image segmentation, an edge is defined as a discontinuity in the grey level, and can be detected only when there is a sharp boundary in the brightness between two regions. However, in color images, the information about edges is much richer than that in the monochrome case. For example, an edge between two objects with the same brightness but different hue is easy to be detected [45]. Edge detection in color images can thus be performed defining a discontinuity in a three–dimensional color space. A fundamental drawback of these methods is their sensitivity to noise. Apart from those listed below, a variety of segmentation methods have been proposed in literature, based on fuzzy techniques, physics approaches, and neural networks. Fuzzy techniques exploit the fuzzy logic in order to model uncertainty. For instance, if the fuzzy theory is used in combination with a clustering method, a score can be assigned to each pixel, representing its “degree of membership” w.r.t. each region. Physics approaches aim at solving the segmentation problem by employing physical models to locate the objects’ boundaries, while eliminating the spurious edges due to shadows or highlights. Finally, neural network approaches exploit a wide variety of network architectures (Hopfield neural network, Self Organizing Maps, MLPs). In general, unsupervised approaches are preferable for this task, since providing the target class for each pixel that belongs to an image is very expensive. After the segmentation process, a graph that represents the arrangement of the obtained regions can be extracted. In such graph, the geometrical and visual properties of each region is collected into the label of the related node. Instead, the edges, which link nodes of the structure, are exploited in order to describe the topological arrangement of the extracted regions. The graph can be directed or undirected; moreover, the presence of an edge can represent adjacency or some hierarchical relationship. In the following, three kinds of structures, particularly suited to represent images, will be described: Region adjacency graphs with labelled edges, forest of trees, and multi–resolution trees.
8.3.2 Region Adjacency Graphs The segmentation method yields a set of regions, each region being described by a vector of real valued features. Moreover, the structural information related to the spatial relationships between pairs of regions can be coded by an undirected graph. Two connected regions R1 , R2 are adjacent if, for each pixel a ∈ R1 and b ∈ R2 , there exists a path between a and b, entirely lying into R1 ∪ R2 . The Region Adjacency Graph (RAG) is extracted from the segmented image by (see Fig. 8.1): 1. Associating a node to each region; the real vector of features represents the node label;
192
Monica Bianchini and Franco Scarselli
2. Linking the nodes associated to adjacent regions with undirected edges. A RAG takes into account both the topological arrangement of the regions and the symbolic visual information. Moreover, the RAG connectivity is invariant under translations and rotations (while labels are not), which is a useful property for a high–level representation of images. The information collected in each RAG can be further enriched by associating to each undirected edge a real vector of features (an edge label), which describes the mutual position of the regions associated to the linked nodes. This kind of structure is defined as Region Adjacency Graph with Labelled Edges (RAG–LE). For instance, given a pair of adjacent regions i and j, the label of the edge (i, j) can be defined as the vector [D, A, B,C] (see Fig. 8.4), where: • D represents the distance between the two barycenters; • A measures the angle between the two principal inertial axes; • B is the angle between the intersection of the principal inertial axis of i and the line connecting the barycenters; • C is the angle between the intersection of the principal inertial axis of j and the line connecting the barycenters.
Fig. 8.4 Features stored into the label of each edge. The features describe the relative position of two adjacent regions.
GNNs, which can cope with generic kinds of graphs, are used on RAGs–LE, without any preprocessing phase. Actually, examples of applications of GNNs to
8 Artificial Neural Networks for Processing Graphs
193
image classification and object localization in images can be found in [46, 47]. subsectionFrom RAGs–LE to forest of trees In order to cope with RAGs, the RNN model described in the previous section presupposes that each RAG–LE must be transformed into a directed acyclic graph. Such a transformation takes a RAG–LE R, along with a selected node n, as input, and produces a tree T with root n. It can be proved that the trees built from R contain the same information as R [15]. Actually, R is unfolded into T by the following algorithm: 1. Insert a copy of n in T ; 2. Visit R starting from n, using a breadth–first strategy; for each visited node v, insert a copy of v into T , link v to its parents, preserving the information attached to each edge; 3. Repeat step 2. until a predefined stop criterion is satisfied, anyway until all the arcs have been visited in both the directions (at least) once. At the end of the algorithm, the target associated to n in R is attached to the root node of T (see Fig. 8.5). According to [15], the above unfolding strategy produces a recursive–equivalent tree, that holds the same information contained in R. Based on the selection of the stop criterion, if the breadth–first visit is halted when all the arcs have been visited once, the minimal recursive–equivalent tree is obtained (Minimal unfolding, Fig. 8.5a). A different approach comes by visiting each edge, and repeating step 2. until a probabilistic variable x becomes true (Probabilistic unfolding, Fig. 8.5b). Finally, the breadth–first visit can be substituted with a random visit of the graph (Random unfolding, Fig. 8.5c). Anyway, each edge must be visited at least once in both directions in order to guarantee the recursive–equivalence between R and T . It is worth noting that the above unfolding procedure is useful only when one output for each graph has to be produced. Such a situation arises in those applications, as in image classification, where the goal is to predict a class or a value associated to the whole graph. However, in other applications the predicted values are related to properties of the single nodes. For example, in object localization the property to be predicted is whether a certain region represents a part of an object or not. In those cases, the unfolding procedure must be repeated for each node of the input and the result is a forest of trees [15]. Remark: For large graphs, the unfolding strategy is computationally expensive. In fact, in the context of object localization, the learning set complexity can be reduced selecting only a subset of the nodes, that will constitute the roots of the forest of trees. Let Vo be the set of nodes which correspond to the parts of the object to be localized. The learning set should contain all the nodes in Vo and a random set of nodes not representing the object, with a cardinality equal to |Vo |, in order to have a balanced training. On the other hand, during the test phase, the RAGs–LE must be unfolded starting from each node. This limitation can be overcome representing the whole image by a unique graph (see Fig. 8.6), holding the same information as the forest of trees. Such a graph can be obtained by unfolding all the trees at the same time, and it is composed by layers. Each layer contains a copy of all the nodes of the original RAG–LE, whereas the link beetwen layers are represented by the arcs starting from each node. Finally, each node belonging to the last layer has an associated target, needed for the training process
194
Monica Bianchini and Franco Scarselli
Fig. 8.5 A graph and three different unfolding trees
of the RNN. Actually, this method represents an extension of the dataset merging proposed in [6], since the graph can be interpreted as a particular merging of the forest of trees. Finally, it can be noticed that the graph contains dl edges, where d is the depth of the unfolding and l is the number of edges in the RAG–LE. It can be easily proved that both the time and the space complexity of the unfolding depend linearly on the dimension of the graph, i.e. the RAG–LE unfolding requires O(dl) operations (and memory locations). The representation based on the unfolding trees and the RNN model has been used for image classification in [11, 46]. The approach has been later extended to face localization [10, 30] and to (general) object localization [10, 29], by the adoption of the unfolding forest concept. It is worth noting that the introduction of the GNN model, which can process cyclic graphs and produces an output for each node, has provided an alternative approach for object localization that does not require the explicit construction of the unfolding forest. Interestingly, however, the two approaches are almost equivalent in terms of their computational and learning capabilities.
8.3.3 Multi–resolution trees Multi–resolution trees (MRTs) are hierarchical data structures, which are generated during the segmentation process, like, for instance, quad–trees [48]. While quad– trees can be used to represent a region splitting process, MRTs are used to describe
8 Artificial Neural Networks for Processing Graphs
195
Fig. 8.6 The graph obtained by merging the unfolding trees of a forest
the region growing phase of the segmentation. Some different hierarchical structures, like monotonic trees [49] or contour trees [50, 51, 52], can be exploited to describe the set of homogeneous regions obtained at the end of the segmentation process, in which the links between nodes represent the inclusion relationship established among region boundaries. On the contrary, MRTs epitomize both the result of the segmentation, and the sequence of steps which produces the final set of regions. An MRT is built performing the following steps (see Fig. 8.7): • Each region obtained at the end of a color clustering phase is associated to a leaf of the tree; • During the region growing phase, when two regions are merged together, a parent node is added to the tree, connected to both the nodes corresponding to the merged regions; • At the end of the region growing step, a virtual node is added as the root of the tree. Nodes corresponding to the set of regions obtained at the end of the segmentation process become the children of the root node. To each node of the MRT, except for the root, is attached a real vector label, which describes the geometrical and the visual properties of the associated region. Moreover, also each edge can be labelled by a vector which collects information regarding the merging process. Considering a pair of nodes joined by an edge, the region asso-
196
Monica Bianchini and Franco Scarselli
Fig. 8.7 Multi–resolution tree generation: white nodes represent vertexes added to the structure when a pair of similar regions are merged together.
ciated to the child node is completely contained in the region associated to the parent node, and it is useful to associate some features to the edge in order to describe how the child contribute to the creation of its parent. For instance, some informative features can be the color distance between the two regions, the distance between their barycenters, and the ratio of their respective areas (child w.r.t. parent). Finally, notice that, MRTs do not describe directly the topological arrangement of the regions, that however can be inferred considering both the geometrical features associated to each node (for instance, the coordinates of the bounding box of each region can be stored in the node label) and the MRT structure. The MRT representation has been used together with the RNN model for image classification. The results prove that RNNs tend to obtain a better performance when MRTs are employed in place of the unfolding trees described in the previous section [53]. Finally, it is worth mentioning
8 Artificial Neural Networks for Processing Graphs
197
that other image representation approaches can be suited for particular kinds of images. For example, contour trees are useful to represent artificial images. Actually, RNNs have been applied along with contour trees to logo classification [25].
8.4 Conclusions In this chapter, a brief survey on connectionist models recently developed to process graphs is presented, showing how they represent a powerful tool to address all those problems where the information is naturally organized in entities and relationships among entities. Image understanding tasks usually belong to this category, since codifying images by trees or graphs gives rise to a more robust and informative representation, aimed at facilitating object detection and image classification. Not pretending to be exhaustive, the intent of this chapter remains that of proposing some different ways to represent images by trees/graphs and, correspondingly, neural network models able to process such structures, based on the particular image processing task to be faced.
References 1. Gori, M., Maggini, M., Sarti, L.: Exact and approximate graph matching using random walks. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7) 1100–1111 (2005) 2. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In Haussler, D., ed.: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press 144–152 (1992) 3. Vapnik, V.: The Nature of Statistical Learning Theory. Springer–Verlag (1995) 4. G¨artner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(1) 49–58 (2003) 5. Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks 8(3) 714–735 (1997) 6. Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9(5) 768–786 (1998) 7. K¨uchler, A., Goller, C.: Inductive learning in symbolic domains using structure–driven recurrent neural networks. In G¨orz, G., H¨olldobler, S., eds.: Advances in Artificial Intelligence. Springer, Berlin 183–197 (1996) 8. Elman, J.: Finding structure in time. Cognitive Science 14 179–211 (1990) 9. Bianchini, M., Gori, M., Scarselli, F.: Theoretical properties of recursive networks with linear neurons. IEEE Transactions on Neural Networks 12(5) 953–967 (2001) 10. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F.: Recursive neural networks for processing graphs with labelled edges: Theory and applications. Neural Networks - Special Issue on Neural Networks and Kernel Methods for Structured Domains 18(8) 1040–1050 (2005) 11. Gori, M., Maggini, M., Sarti, L.: A recursive neural network model for processing directed acyclic graphs with labeled edges. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2003). 1351–1355 (2003) 12. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F.: Recursive neural networks for processing graphs with labelled edges. In: Proceedings of ESANN 2004, Bruges (Belgium) 325–330 (2004)
198
Monica Bianchini and Franco Scarselli
13. Bianucci, A.M., Micheli, A., Sperduti, A., Starita, A.: Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. Journal of Chemical Information and Computer Sciences 41(1) 202–218 (2001) 14. Bianchini, M., Gori, M., Scarselli, F.: Recursive processing of cyclic graphs. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2002) 154–159 (2002) 15. Bianchini, M., Gori, M., Sarti, L., Scarselli, F.: Recursive processing of cyclic structures. IEEE Transactions on Neural Networks 17(1) 10–18 (2006) 16. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Transactions on Neural Networks 20(1) 61–80 (2009) 17. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN 2005). (2005) 18. Hammer, B., Micheli, A., Strickert, M., Sperduti, A.: A general framework for unsupervised processing of structured data. Neurocomputing 57 3–35 (2004) 19. G¨unter, S., Bunke, H.: Validation indices for graph clustering. In Jolion, J.M., Kropatsch, W., Vento, M., eds.: Proceedings of the third IAPR–TC15 Workshop on Graph–based Representations in Pattern Recognition. 229–238 (2001) 20. Kohonen, T., Sommervuo, P.: How to make large self–organizing maps for nonvectorial data. Neural Networks 15(8–9) 945–952 (2002) 21. Chappell, G., Taylor, J.: The temporal Kohonen map. Neural Networks 6 441–445 (1993) 22. Koskela, T., Varsta, M., Heikkonen, J., Kaski, K.: Recurrent SOM with local linear models in time series prediction. In Verleysen, M., ed.: Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN 1998). 167–172 (1998) 23. Koskela, T., Varsta, M., Heikkonen, J., Kaski, K.: Time series prediction using recurrent SOM with local linear models. In: Proceedings of the Int. J. Conf. of Knowledge–Based Intelligent Engineering Systems. 2(1). 60–68 (1998) 24. Farkas, I., Mikkulainen, R.: Modeling the self–organization of directional selectivity in the primary visual cortex. In: Proceedings of the International Conference on Artificial Neural Networks, Springer 251–256 (1999) 25. Diligenti, M., Gori, M., Maggini, M., Martinelli, E.: Adaptive graphical pattern recognition for the classification of company logos. Pattern Recognition 34 2049–2061 (2001) 26. de Mauro, C., Diligenti, M., Gori, M., Maggini, M.: Similarity learning for graph based image representation. Special issue of Pattern Recognition Letters 24(8) 1115–1122 (2003) 27. Bianchini, M., Mazzoni, P., Sarti, L., Scarselli, F.: Face spotting in color images using recursive neural networks. In Gori, M., Marinai, S., eds.: IAPR – TC3 International Workshop on Artificial Neural Networks in Pattern Recognition, Florence (Italy) (2003) 28. Bianchini, M., Gori, M., Mazzoni, P., Sarti, L., Scarselli, F.: Face localization with recursive neural networks. In Marinaro, M., Tagliaferri, R., eds.: Neural Nets — WIRN ’03. Springer, Vietri (Salerno, Italy) (2003) 29. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F.: Recursive neural networks for object detection. In: Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN 2004). 3 1911–1915(2004) 30. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F.: Recursive neural networks learn to localize faces. Pattern Recognition Letters, 26(12) 1885–1895 (2005) 31. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks, 20(1) 81–109 (2009) 32. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, New York (1994) 33. McClelland, J., Rumelhart, D.E.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 2. MIT Press, Cambridge (1986) 34. Khamsi, M.A.: An Introduction to Metric Spaces and Fixed Point Theory. John Wiley & Sons Inc (2001)
8 Artificial Neural Networks for Processing Graphs
199
35. Almeida, L.: A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Caudill, M., Butler, C., eds.: IEEE International Conference on Neural Networks. 609–618 36. Pineda, F.: Generalization of back–propagation to recurrent neural networks. Physical Review Letters 59 2229–2232 (1987) 37. Vullo, A., Frasconi, P.: Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 20 653–659 (2004) 38. Blockeel, H., Uwents, W.: Using Neural Networks for Relational Learning. SRL2004, ICML 2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (2004) 39. Sperduti, A.: Labelling RAAM. Connection Science 6(4) 429–459 (1994) 40. Hagenbuchner, M., Sperduti, A., Tsoi, A.C.: A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks 14(3) (May 2003) 491– 505 41. Hagenbuchner, M., Sperduti, A., Tsoi, A.C.: Contextual processing of graphs using selforganizing maps. In: ESANN. 399–404 (2005) 42. Pat, S.K.: A review on image segmentation techniques. Pattern Recognition 29 1277–1294 (1993) 43. Fu, K., Mui, J.K.: A survey on image segmentation. Pattern Recognition 13 3–16 (1981) 44. Haralick, R., Shapiro, L.: Image segmentation techniques. Computer Vision, Graphics and Image Processing 29 100–132 (1985) 45. Macaire, L., Ultre, V., Postaire, J.: Determination of compatibility coefficients for color edge detection by relaxation. In: Proceedings of the International Conference on Image Processing. 1045–1048 (1996) 46. Di Massa, V., Monfardini, G., Sarti, L., Scarselli, F., Maggini, M., Gori, M.: A comparison between recursive neural networks and graph neural networks. In: Proceedings of International Joint Conference on Neural Networks, (2006) 47. Monfardini, G., Di Massa, V., Scarselli, F., Gori, M.: Graph neural networks for object localization. In: 17th European Conference on Artificial Intelligence. (2006) 48. Hunter, G.M., Steiglitz, K.: Operations on images using quadtrees. IEEE Transactions on Pattern Analysis and Machine Intelligence 1(2) 145–153 (1979) 49. Song, Y., Zhang, A.: Monotonic tree. In: Proceedings of the 10th International Conference on Discrete Geometry for Computer Imagery, Bordeaux – France (2002) 50. Morse, S.: Concepts of use in computer map processing. Communications of the ACM 12(3) 147–152 (1969) 51. Roubal, J., Peucker, T.: Automated contour labeling and the contour tree. In: Proceedings of AUTO-CARTO 7. 472–481 (1985) 52. van Kreveld, M., van Oostrum, R., Bajaj, C., Pascucci, V., Schikore, D.: Contour trees and small seed sets for iso–surface traversal. In: Proceedings of the 13th Annual Symposium on Computational Geometry. 212–220 (1997) 53. Bianchini, M., Maggini, M., Sarti, L.: Object recognition using multiresolution trees. In: Proceedings of Joint IAPR International Workshops, SSPR 2006 and SPR 2006, Hong Kong, China 331–339 (2006)
Index
3D Face Model, 73 3D Tracking algorithm, 90
Interaction Techniques, 121, 126 keyboard navigation system, 168
aggregation functions, 142 L-Exos system, 168 Backus-Naur, 170 Bayesian Decision theory, 142 Bayesian network, 143 Blob Detection, 27 CIE-Lab, 25 Color Flatting, 26 Color Spaces, 20 Decision Trees, 142 Dempster and Shafer theory, 142 Edge Detection, 30 Face Detection, 43, 52 Face Identification, 141 fuzzy decision tree, 148 GOP Structure, 5 Graph Processing, 183 GSM/3G, 137 H.264, 8 HCI, 174 HSL, 23 HSV, 23 Illumination, 87 Image Representation, 189 Image Segmentation, 189 Interaction Metaphors, 121
Magic Metaphors, 122 Model Fitting, 94 MPEG-2, 8 Neural Networks, 183 Object Detection, 53 Pattern Recognition, 179 PCA, 81, 82 RBAC Role Based Access Control, 137 Realistic Methapors, 122 Recursive Neural Networks, 186 Shape Alignment, 34 Shape analysis, 33 Shape Encoding, 35 Smart Home, 169 smartphones, 167 Transcoding, 3 Virtual Environments, 116 Virtual Hand, 122 Virtual Therapy Room, 175 VR, 174 Wiimote, 124